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FOREWORD 


Trie  research  presented  in  this  report  was  conducted  under  Project 
METTEST  (Methodological  Issues  in  Criterion-Fteferenced  Testing)  , in  the 
Unit  Training  and  Evaluation  Systems  (UTES)  Technical  Area  of  API  under 
Army  RDTE  Project  2Q62722A764.  The  goal  of  Project  METTEST  is  to  pro- 
vide quantitative  methods  for  evaluating  unit  proficiency.  The  means 
for  achieving  this  goal  include  basic  research  in  test  construction 
methodology,  measurement  and  scaling  models,  and  decisionmaking  impli- 
cations of  test  score  interpretation. 

Related,  ongoing  programs  within  the  UTES  Technical  Area  include 
evaluation  of  small  combat  units  under  simulated  battlefield  conditions 
(REALTRAIN,  ARTEP) , qualification  of  tank  crews  and  platoon  gunnery 
(I DOC) , and  improvement  of  the  reliability  of  ARTEP  evaluation. 

Anticipated  future  research  under  Project  METTEST  includes  the  de- 
velopment of  a computer  model  for  performance  evaluation,  and  develop- 
ment of  measurement,  scaling,  scoring,  decisionmaking,  and  quality 
control  models  for  use  in  performance  evaluations  when  criterion- 
referenced  testing  procedures  are  employed. 

ARI  research  in  this  area  is  conducted  as  an  in-house  research  ef- 
fort augmented  by  contracts  with  organizations  selected  as  having  unique 
capabilities  and  facilities  for  research  in  a specific  area.  The  pres- 
ent study  was  conducted  in  col labo ration  with  personnel  of  the  Univer- 
sity of  Maryland  under  Contract  No.  DAHC19-75-M-0003 . 
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CRITERION- REFERENCED  TESTING:  A CRITICAL  ANALYSIS  OF  SELECTED  MODELS 


BRIEF 


Requi rement : 

To  develop  a theoretical  base  for  research  and  eventual  application 
of  methods  for  assigning  pass- fail  scores  in  personnel  and  unit  evalua- 
tion using  the  criterion- referenced  testing  approach. 


Procedure : 

Relevant  literature  for  each  of  five  approaches  to  criterion- 
referenced  testing  was  reviewed.  The  approaches  were  compared  on  the 
basis  of  the  following:  assumptions  and  rationale,  the  interactive  ef- 
fects of  test  length  and  passing  criteria  on  classification  accuracy, 
and  areas  of  applicability.  A computational  example  was  prepared  for 
each  model,  and  strengths  and  weaknesses  were  also  evaluated. 

Findings: 

Four  of  the  five  models  were  able  to  specify  an  "optimal"  test 
length  and  cutoff  score,  althouah  they  differed  as  to. the  required 
parameter  estimates  from  "the  test  developer.  For  example,  expert 
"prior"  information  can  be  used  to  reduce  test  length.  Each  of  the 
models  also  provides  an  estimate  for  misclassifications , or  Type  I and 
Type  II  errors.  The  models  are  neither  redundant  nor  interchangeable. 
No  "best"  method  was  identified.  Rather,  the  selection  of  a model  de- 
pends upon  the  particular  measurement  requirements  and  constraints  as 
identified  by  the  test  developer. 


Utilization  of  findings: 

This  research  provides  qualitative  and  quantitative  guidelines  for 
developers  of  criterion- referenced  tests.  The  models  have  been  applied 
to  analyze  data  from  the  handgun  qualification  course  at  the  U.S.  Army 
Military  Police  School.  Application  of  the  models  has  also  been  ad- 
dressed to  revision  of  Table  VIII  tank  gunnery. 
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CRITERION-REFERENCED  TFSTING:  A CRITICAL 
ANALYSIS  OF  SELECTED  MODELS 


INTRODUCTION 

Scoring  and  decisionmaking  models  for  criterion- referenced  testing 
deal  with  two  questions  of  practical  and  theoretical  importance:  (1) 
how  much  test  information  should  be  collected  to  provide  a basis  for 
confident  decisions  about  the  mastery  or  nonmastery  of  trained  skills; 
and  (2)  what  are  the  methods  of  establishing  statistically  valid  stand- 
ards of  achievement.  Criterion- referenced  testinq  (CRT)  requires  that 
the  data  provide  information  about  performance  capabilities  measured 
against  some  external  criterion  (Glaser  & Nitko,  1971;  Carver,  1974) . 

Such  criteria  are  properly  derived  from  an  analysis  of  the  requirements 
for  performing  specific  tasks  successfully. 

Measurement  of  mastery  implies  that  CRT's  should  represent  the  skills 
to  be  measured  with  high  fidelity.  However,  serious  constraints  are 
imposed  by  requiring  high  fidelity:  (1)  the  time  needed  to  administer 
the  test  may  be  more  than  is  readily  available;  (2)  the  number  of  exami- 
ners needed  to  administer  the  test  and  collect  data  may  L-e  excessive; 

(3)  the  expenditure  of  materials  used  in  testing  may  be  prohibitively 
high;  and  (4)  the  appropriate  testing  materials  or  apparatus  may  not 
be  available  for  a long  enough  time.  These  constraints  place  a premium 
upon  limiting  test  data  to  the  minimum  amount  sufficient  for  the  desired 
quality  of  decisionmaking.  Statistical  models  offer  one  means  of  accom- 
plishing this  goal. 

Two  problems  arise  in  establishing  achievement  standards  on  CRT's. 

The  first  is  related  to  the  congruence  between  CRT  performance  and  real- 
world  requirements.  The  second  is  related  to  the  statistical  inferences 
applied  to  observed  CRT  scores . 

Before  any  statistical  model  can  be  used  in  a CRT  situation,  the 
requirements  for  mastery  over  the  domain  in  general  must  be  specified. 

The  requirements  usually  describe  the  capabilities  of  persons  who  can 
successfully  perform  the  tasks  included  in  the  domain.  Glaser  and 
Klaus  (1963)  suggest  that  "proficiency  standards  can  be  established 
at  any  value  between  the  point  where  the  system  will  not  perform  at 
all  and  the  point  where  any  further  contribution  from  the  human  com- 
ponent will  not  yield  any  increase  in  system  performance  (p.  424) ." 

These  system  requirements  may  include  the  human  performance  com- 
ponents of  industrial- vocational  tasks,  minimal  competencies  in  an 
educational  system,  or  basic  literacy  skills.  System  requirements 
may  also  reflect  manpower  needs,  the  criticality  of  the  task,  or  the 
consequences  of  poor  performance.  Such  idealized  standards  must  then 
be  converted  to  standards  on  a particular  CRT.  The  conversion  process 
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involves  issues  of  test  validity  which  are  beyond  the  scope  of  this 
paper.  Meskauskas  (1976)  discusses  several  methods  that  have  been  used 
to  bridge  the  gap  between  operational  tests  and  real-world  requirements. 

If  the  CRT  includes  the  entire  full  fidelity  task,  such  as  disas- 
sembling and  cleaning  a particular  piece  of  machinery,  then  setting 
mastery  standards  is  relatively  clear  and  unambiguous.  However,  if  the 
CRT  includes  only  a sample  of  the  full  fidelity  task,  or  if  fidelity  is 
decreased  for  practical  purposes,  then  mastery  standards  for  the  CRT 
are  not  clearcut.  Heretofore,  the  use  of  arbitrary  cutoff  scores  has 
kept  this  problem  at  a manageable  level.  For  example,  objectives  often 
include  a statement  of  standards  requiring  a certain  minimum  percent 
correct  for  attainment  of  mastery  status.  Two  criticisms  can  be  di- 
rected at  this  concept  of  mastery. 

First,  any  percentage  correct  is  a relative  standard.  The  defini- 
tion of  mastery  has  been  shown  (Millman,  1972;  Novick  & Lewis,  1974; 
Epstein  & Steinheiser,  1975)  to  be  a function  both  of  the  percentage 
correct  and  of  the  number  of  trials  or  items  that  comprise  the  test. 

A more  comprehensive  definition  could  be  based  either  upon  (1)  an  ideal- 
ization, such  as  the  proportion  of  correct  answers  of  all  possible  test 
items,  or  (2)  the  position  on  an  underlying  continuum  of  ability  hypoth- 
esized to  score  an  examinee  on  a qiven  test.  By  stating  standards  in 
terms  of  such  an  idealizati.  i or  ability  continuum,  it  is  possible  to 
explicitly  define  mastery  cutoff  scores  for  any  test  length. 

The  second  criticism  refers  to  the  level  of  ability  required  for 
mastery.  For  example,  why  should  one  standard  (such  as  80%  correct) 
be  set  rather  than  another  (such  as  70%  or  90%)?  Perhaps  this  question 
could  be  answered  by  empirical  studies  showing  the  relationship  between 
CRT  scores  and  the  transfer  or  retention  of  training.  The  required 
level  of  mastery  could  also  be  determined  by  system  requirements,  criti- 
cality, and  similar  factors. 

Each  of  the  models  discussed  in  this  paper,  with  the  exception  of 
Block's  (1972)  approach  to  setting  standards  empirically,  assumes  that 
a well-defined  universe  of  items  exists  or  can  be  generated.  The  authors 
also  assume  that  the  role  of  the  statistical  model  is  to  describe  accu- 
rately an  examinee  with  respect  to  that  universe.  The  validity  of  the 
generalization  from  the  universe  of  items  to  the  real  world  is  not  in- 
vestigated. The  models  further  assume  that  a mastery  standard  relative 
to  the  entire  universe  can  be  established.  Given  these  assumptions, 
the  problem  is  how  to  interpret  the  observations.  The  following  section 
discusses  theoretical  issues  which  may  produce  possible  solutions.  Table 
1 then  introduces  and  summarizes  the  specific  models. 

The  problem  of  setting  standards  arises  because  it  is  often  imprac- 
tical to  insist  upon  complete  mastery  of  a task,  or  even  to  require  a 
very  high  percentage  of  correct  answers  to  the  items  comprising  a CRT. 
Furthermore,  it  is  often  impossible  to  list  all  of  the  potential  items 


[of  a given  task  domain.  For  example,  an  indefinitely  large  number  of 
multiplication  items  could  comprise  an  item  universe  from  which  a sam- 
ple of  items  are  selected.  An  arbitrary  standard  would  determine  that 
the  examinee  answering  a specified  number  (or  percentage)  of  the  sam- 
ple correctly  will  be  classified  as  a "master"  of  multiplication.  The 
main  purpose  of  the  present  paper  is  to  evaluate  several  mathematical 
models  that  claim  to  reduce  the  arbitrariness  in  setting  criteria  for 
mastery  on  tests  representing  a sample  of  the  test-itern  universe.  The 
motivation  for  developing  models  by  which  criteria  for  mastery  can  be 
derived  formally  arises  from  the  goal  of  trying  to  minimize  misclassifi- 
cations  (i.e.,  designating  a "true  master"  as  a "nonmaster"  or  vice 
versa) . The  more  complex  the  skills  assessed  by  the  CRT,  the  smaller 
the  sample  of  items,  and  the  more  varied  the  type  of  performance  in- 
cluded in  the  universe,  the  greater  the  danger  of  misclassification. 

Theoretical  Problems  for  CRT  Models 

Nature  of  Performance  Acquisition.  Is  the  attainment  of  mastery 
an  "all-or-none"  occurrence,  or  is  there  a continuum  of  varying  degrees 
of  skill  acquisition?  The  widely  accepted  dichotomy  of  master  vs. 
nonmaster  may  be  overly  simplistic.  The  alternative  is  a continuum  of 
varying  degrees  of  mastery.  Both  dichotomous  and  continuous  CRT  models 
are  available  in  the  literature. 

Measurement  Error.  One  type  of  error,  similar  to  the  classical 
psychometric  notion  of  measurement  error,  refers  to  random  inappropriate 
responses  due  to  temporary  environmental  distractions,  lucky  guesses, 
lapses  in  attention,  etc.  The  magnitude  of  such  error  can  be  estimated 
and  included  in  the  estimation  of  actual  ability  and  in  the  determina- 
tion of  test  standards  and  lengths. 

A second  type,  "classification"  error,  refers  to  the  (usually) 
dichotomous  classification  of  an  examinee  as  a master  or  nonmaster. 

Its  magnitude  and  direction  are  primarily  a function  of  how  a cutoff 
score  is  chosen.  Classification  error  will  tend  to  increase  as  the 
accuracy  in  estimating  actual  ability  decreases,  but  a mathematically 
defined  relationship  between  measurement  error  and  classification  error 
has  not  been  derived  (Guilford,  1956,  pp.  380-384) . 

Test  Length  to  Distinguish  Masters  from  Nonmasters.  One  technique 
to  improve  ability  estimation  and  reduce  the  chance  for  misclassifica- 
tion is  to  increase  the  number  of  test  items.  In  some  situations  this 
may  be  possible  simply  by  repeating  items  until  the  desired  level  of 
precision  is  attained.  However,  in  most  cases,  test  length  cannot  be 
indefinitely  increased.  Therefore,  a statistical  model  that  provides 
increased  information  per  item  is  highly  desirable.  Generally,  a CRT 
model  should  provide  sufficient  information  to  decisionmakers  so  that 
they  will  know  the  risks  of  committing  false  positive  and  false  nega- 
tive errors  before  the  test  is  conducted. 
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Overview  of  Selected  CRT  Models 


The  CRT  models  discussed  in  this  paper  were  chosen  to  try  to  illus- 
trate the  diversity  in  approaches  to  the  problems  outlined  in  the  pre- 
ceding section.  Methods  developed  by  Crehan  (1974)  and  Block  (1972) 
are  basically  empirical  in  that  cutoff  scores  are  based  upon  empirically 
derived  requirements.  Models  derived  by  Emrick  (1971)  and  by  Macready  and 
Dayton  (1976)  assume  a dichotomous  definition  of  mastery  and  analytically 
describe  procedures  for  establishing  cutoff  scores.  Kriewall  (1969)  and 
Millman  (1972,  1974)  assume  that  responses  to  test  items  and  examinee 
ability  can  be  described  by  the  family  of  binomial  distributions.  Their 
basic  models  can  be  extended  by  applying  the  theory  of  binomial  error 
models  (Lord  & Novick,  1968) . Novick  and  Lewis  (1974)  discuss  the  ap- 
plication of  a Bayesian  approach  to  CRT  issues.  A one-parameter  logis- 
tic model  (Rasch,  1960;  Wright,  1967)  provides  a practical  example  of 
how  latent  trait  theory  may  be  applied  to  CRT  data  analysis.  Finally, 
an  approach  for  CRT  data  analysis  derived  from  classical  regression 
theory  is  discussed.  Each  model  is  examined  in  terms  of  rationale  and 
assumptions,  empirical  support  and  applications,  illustrative  examples 
of  the  type  of  input  required  and  output  provided,  and  critical 
evaluation. 


REVIEW  OF  MODELS 


Block 


Block's  (1972)  research  provides  an  experimental  approach  to  set- 
ting mastery  standards . He  studied  the  relationship  between  the  level 
of  performance  required  on  each  unit  of  a three-unit  instructional  se- 
quence and  five  cognitive  and  affective  outcome  variables.  The  ration- 
ale for  this  study  was  the  intuitive  notion  that  maximum  performance  on 
an  external  measure  of  achievement  would  be  observed  in  students  having 
the  most  stringent  passing  requirements  in  the  instruction.  A second 
question  concerned  the  relationship  between  scores  on  an  affective 
measure  of  interest  and  attitude  and  passing  requirements  in  instruction. 

Block's  experiment  included  four  treatment  groups  that  differed 
from  one  instructional  unit  to  the  next  with  respect  to  the  standard 
required  for  advancement.  If  the  stildent  did  not  meet  the  standard 
(65%,  75%,  85%,  or  95%  of  the  items  correct  on  a 20-item  test) , reme- 
dial instruction  was  provided.  Students  in  a control  group  proceeded 
from  one  unit  to  the  next  with  no  remediation,  regardless  of  their  test 
score.  Five  outcome  variables  were  defined:  achievement,  learninq  rate, 
transfer,  interest,  and  attitude. 

Transfer  was  measured  by  a 10-item  test  which  required  the  use  of 
the  learned  skills  to  solve  a novel  set  of  problems.  It  was  given  both 
as  a pretest  and  after  instruction.  Interest  and  attitude  were  measured 
using  a 24-item  questionnaire. 


Most  of  the  results  supported  the  intuitive  hypothesis.  The  con- 
trol group  did  consistently  worse  on  achievement,  transfer,  and  reten- 
tion, than  any  of  the  experimental  groups,  and  the  learning  curves  sug- 
gested that  high  standards  early  in  an  instructional  sequence  may  produce 
increased  efficiency  later  in  the  sequence.  However,  several  interesting 
exceptions  to  the  intuitive  expectations  sugqest  that  hiqher  standards 
are  not  always  better  standards.  For  example,  the  85%  and  95%  qroups 
did  not  differ  from  one  another  on  retention  or  achievement  measures, 
althouqh  they  both  differed  from  the  control  group.  Only  the  85%  qroup 
produced  sustained  hiqh  levels  of  interest  and  attitude. 

Block's  research  suqgests  that  a unitary  definition  of  an  "optimum" 
CRT  cutting  score  may  be  questionable.  If  uniformly  high  achievement 
and  transfer  are  required  at  the  possible  expense  of  positive  interest 
and  attitude,  it  may  be  that  the  highest  mastery  standard  should  be  used. 
However,  if  some  "mix"  of  cognitive  and  affective  outcomes  is  desired, 
then  a lower  standard  seems  appropriate. 

Similar  studies  could  be  conducted  on  a wide  range  of  instructional 
programs  for  a wide  variety  of  outcomes.  The  results  could  lead  to 
usable  and  meaningful  guidelines  for  setting  cutting  scores  to  optimize 
a number  of  instructional  outcomes.  Because  the  results  may  not  be  gen- 
eral izable  across  content  areas  and  instructional  proqrams,  such  an  op- 
timization strategy  would  require  costly  and  extensive  research.  This 
empirical  verification  of  a decisionmaking  strategy  for  findinq  optimal 
mixes  of  cognitive  and  affective  outcomes  does  not  mathematically  model 
any  of  the  problems  outlined  in  the  previous  section  of  this  paper.  A 
truly  complete  scoring  and  decisionmaking  CRT  model  would  take  into  ac- 
count both  the  psychological  variables  that  characterize  optimum  learn- 
ing and  the  constraints  imposed  by  test  length,  cutting  scores,  and 
misclassification  rates. 


Crehan 

A method  used  by  Crehan  (1974)  also  relies  heavily  on  a traininq 
context  for  its  interpretation.  The  method's  rationale  for  specifying 
cutting  scores  is  based  upon  the  comparison  of  the  test  scores  of  stu- 
dents who  have  completed  training  with  the  test  scores  of  those  who  have 
not  yet  received  training.  This  method  provides  a means  of  assessinq 
the  proportion  of  misclassified  students  within  each  qroup  when  various 
cutting  scores  are  used. 


Correct  classification  occurs  when  posttraining  students  pass  the 
test  and  students  with  no  training  fail  the  test.  Using  a 2 x 2 matrix 
of  pass- fail  and  training- no  training  for  each  cutting  score,  the  pro- 
portion of  correct  classifications  Pc  can  be  obtained  as  follows: 

P = [number  who  had  training  and  passed  + number  who  had  no  train- 
ing and  failed]  * sum  of  all  four  entries  in  the  matrix. 
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A cutting  score  is  found  by  choosing  the  score  that  maximizes  the  pro- 
portion of  correct  classifications. 


For  example,  assume  that  the  distribution  of  scores  on  a five-item 
CRT  for  an  untrained  group  and  a group  that  has  completed  training  is 
as  follows: 

Number  Correct  No  Training  Completed  Training 

0 10  0 

15  0 

2 4 1 

3 0 5 

4 1 10 

5 0 4 

A series  of  fourfold  tables  in  Table  2 displays  the  relationships  be- 
tween cutting  score,  pass- fail  decisions,  and  the  amount  of  training. 
Pc,  the  proportion  of  correct  classifications,  is  calculated  for  each 
fourfold  table.  The  highest  value  of  Pc  in  this  example  is  found  when 
three  correct  responses  are  used  as  the  cutting  score.  Therefore,  for 
this  training  program,  a cutting  score  of  3 would  be  recommended  as  the 
optimal  cutting  score. 


The  major  strength  of  this  procedure  is  that  it  provides  an  esti- 
mate of  the  optimal  cutting  score  for  dif ferentiatina  between  trained 
and  untrained  groups  while  remaining  relatively  simple  to  implement . 
However,  these  two  groups  do  not  necessarily  correspond  to  the  cate- 
gories of  "masters"  and  "nonmasters"  in  terms  of  the  ability  of  group 
members  to  complete  an  objective.  Instead,  one  might  expect  the  post- 
training group  to  perform  less  well  than  a qroup  consisting  entirely  of 
examinees  who  have  mastered  the  objective,  and  the  pretraining  group  to 
perform  somewhat  better  than  a group  of  examinees,  none  of  whom  has 
mastered  the  objective. 

The  simplicity  of  Crehan's  procedure  is  partially  offset  by  a num- 
ber of  weaknesses,  including  the  following:  (1)  lack  of  a procedure  for 
estimating  the  minimum  item  sample  size  necessary  to  keep  the  probability 
of  misclassification  at  or  below  some  specified  level;  and  (2)  lack  of 
statistical  criteria  for  differentiating  between  Pc's  which  "seem"  to 
be  similar  (or  different) . 

Macready  and  Daybon  - Emrick 

Assumptions  and  Rationale.  Two  related  probabilistic  models  that 
provide  probability  estimates  of  the  2n  possible  response  patterns  on 
a dichotomous ly  scored,  n-item  test  are  discussed  in  this  section 
(Emrick,  1971;  Dayton  & Macready,  1976;  and  Macready  & Dayton,  1975) . 

Both  models  assume  that  all  examinees  belong  to  one  of  two  possible 
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Table  2 

Example  Data  Matrices  for  the  Crehan  Procedure 

Cutting 
score 


Training  experience 


training 


Completed 

training 


Pass 

Fail 

Pc  = 20/40 

Pass 

Fail 

Pc  = 30/40 

Pass 

Fail 

Pc  = 35/40 

Pass 

Fail 

Pc  = 38/40 


Pass 

Fail 

Pc  = 33/40 


5 


Pass 

Fail 

Pc  = 24/40  = .60 


0 

20 


4 

16 


"true  score  types"  for  any  qiven  domain:  masters,  (M)  ; and  nonmasters, 
(M) . Masters  are  those  individuals  who  have  acquired  the  necessary 
skills  to  respond  correctly  to  all  items  within  the  domain.  Thus  for 
a three-item  test  with  items  sampled  from  the  domain  of  interest,  a 
master’s  true  score  response  pattern  would  be  111,  where  a "one"  indi- 
cates a correct  response  to  an  item.  Conversely,  nonmasters  have  not 
acquired  the  necessary  skills  to  respond  correctly  to  any  item  within 
the  domain;  thus  their  true  score  response  pattern  would  be  000,  where 
a "zero"  indicates  an  incorrect  response  to  an  item.  This  dichotomous 
classification  of  individuals  appears  reasonable  to  the  deqree  that  all 
items  within  a domain  involve  the  same  skill. 

In  general,  it  is  assumed  that  the  only  way  that  any  non-true  score 
response  pattern  can  occur  is  for  a nonmaster  to  make  one  or  more  cor- 
rect "guessinq"  errors  or  for  a master  to  make  one  or  more  forqettinq 
errors.  For  the  first  model  (Macready  & Dayton,  1975),  the  error  prob- 
abilities are  unrestricted  except  for  the  usual  0,  1 bounds  for  proba- 
bilities. a^  and  b^  represent  the  probabilities  of  a "quessinq"  and 
"forgetting"  error,  respectively,  for  item  i.  Furthermore,  P(M)  and 
P(M)  represent  the  proportions  of  examinees  who  are  masters  and  nonmas- 
ters, respectively,  with  the  usual  restrictions:  0 < P(M)  < 1 and 
P (M)  + P(M)  =1.  If  local  independence  amonq  responses  is  assumed, 
then  the  probability  of  the  jth  observed  response  pattern  on  an  n-item 
test  is 

p(  j)  = P(j)M)pfM)  + p ( j | M)  p ( M) 


En  x.  . 1 - x.  .1 

n a.  13  (1  - a.)  13  p (M)  + 

In  1 - x.  . x.  .*| 

IT  b.  13  (1  - b.)  13  p(M)  , (1) 


where  x^j  = [0,1]  is  the  score  of  the  ith  item  for  the  jth  response 
pattern.  Maximum  likelihood  estimates  of  these  parameters  are  obtained 
from  test  data  by  means  of  the  Newton- Raphson  iteration  procedure 
(Rao,  1965,  pp.  302-309). 


Because  of  the  relatively  larqe  number  of  parameters  (2n  + 1)  under 
this  first  model,  there  are  circumstances  in  which  it  is  desirable  to 
utilize  a second  model  (Dayton  & Macready,  1976)  based  on  a more  re- 
strictive set  of  assumptions;  guessinq  errors  for  all  items  are  equal 
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simplification  of  the  formula  defining  the  probability  of  the  occurrence 
of  the  jth  response  pattern  on  an  n-item  test  to 

s . n - s . 

PC  j)  = P ( j | M)  + p ( j | M)  = a 3 (1  - a)  3 p(M)  (2) 

n - s . s . 

+ b 3 (1  - b)  3 p(M), 

where  Sj  is  the  number  of  correct  responses  (i.e.,  number  of  l's)  in  the 
response  pattern. 

Macready  and  Dayton  provide  a discussion  of  how  these  models  can  be 
used  for  making  classification  decisions  with  respect  to  mastery  of  spe- 
cific concepts  or  skills,  and  they  provide  several  examples.  The  dis- 
cussion includes  the  development  of  procedures  for  (1)  assessing  the 
adequacy  of  "fit"  provided  by  the  models,  (2)  identifying  optimal  deci- 
sion rules  for  mastery  classification  that  incorporate  utility  functions 
related  to  costs  of  false  negatives  and  false  positives,  and  (3)  iden- 
tifying minimally  sufficient  numbers  of  items  necessary  to  obtain  accept- 
able levels  of  misclassification. 

Example . For  the  case  of  a three-item  test,  there  are  eight  possi- 
ble response  patterns:  (000),  (001),  (010),  (100),  (110),  (101),  (Oil), 
(111)  . For  the  first  model,  the  2n  + 1 necessary  parameters  correspond 
to  guessing  (a^)  and  forgetting  (bjj  parameters  for  each  item  and  the 
proportion  of  subjects  in  the  examinee  group  who  are  masters.  Maximum 
likelihood  estimates  of  these  parameters  are  obtained  from  the  test 
data. 


For  purposes  of  example  for  Model  I,  assume  the  following  parameter 
values:  aj  = .01,  bj  = .20;  ~ *05,  k>2  = *10;. 83  = .10,  b3  = .05;  and 

P(M)  = P(M)  = .5.  This  might  correspond  to  a test  in  which  the  items 
appeared  to  be  growing  increasingly  easy.  For  the  second  model,  only 
three  parameters  are  found:  a,  b,  and  P(M).  Again  for  purposes  of 
example  for  Model  II,  assume  that  the_obtained  estimates  for  the  param- 
eters are  a = .06,  b = .12,  P(M)  = P(M)  = .5. 

To  find  the  probability  of  observing  each  response  pattern  in  a 
given  examinee  group,  the  probability  of  observing  each  response  pattern 
given  mastery  status  must  be  multiplied  by  the  proportion  of  the  group 
in  that  mastery  status.  For  this  example,  each  response  pattern  must 
be  multiplied  by  p(M)  = P(M)  = .5.  Table  3 shows  the  results  of  these 
calculations . 

The  mastery/nonmastery  decision  rule  is  based  on  the  score  that 
minimizes  the  probability  of  misclassification.  Probability  of  mis- 
classification is  defined  as  the  probability  that  a master  will  not 
achieve  the  cutting  score  times  the  proportion  of  masters  in  the  group 


— ■ .I.— 
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Table  3 


Probability  of  Observing  Response  Patterns  Under  the 
Macready  and  Dayton  Models,  Assuming  P(M)  = P(M)  = .5 


Model  I 

Model  II 

Response 

P (response  pattern) 

P (response 

pattern) 

pattern 

Master 

Nonmaster 

Master 

Nonmaster 

000 

.0005 

.423225 

.000864 

.415292 

001 

.0095 

.047025 

.006336 

.026508 

010 

.00450 

.022275 

.006336 

.026508 

.026508° 

100 

.0020 

.004725 

.006336° 

110 

.0180a 

.000225° 

.046464 

.001692 

101 

.0380 

.000475 

.046464 

.001692 

Oil 

.0855 

.002475 

.046464 

.001692 

111 

.3420 

.000025 

.340736 

.000108 

P (M)  = .5 

P (M)  = .5 

P (M)  = .5 

P (M)  = .5 

ap(M)  = (.2°  x 

.81)  (.1° 

x .91)  (.051 

x .95°)  x .5  = 

.0180. 

bp(M)  = (.011  : 

x .99°)  (. 

051  x .95°)  ( 

.1°  x .91)  x .5 

= .000225. 

Cp(M)  = .122  x 

.881  x .5 

= .006336. 

d „„1 

p(M)  = .06  x 

. 94 2 x .5 

= .026508. 

plus  the  probability  that  a nonmaster  will  equal  or  exceed  it  times  the 
proportion  of  nonmasters  in  the  group.  The  probabilities  for  both  models 
and  all  possible  cutting  scores  are  given  in  Table  4. 

The  final  column  of  Table  4 indicates  that  for  both  models  the  op- 
timal cutting  score  is  2 correct.  Note  that  although  the  cutting  score 
is  the  same  for  both  models,  the  misclassification  under  the  richer 
Model  I is  consistently  smaller  than  Model  II. 

Emrick  (1971)  developed  a procedure  related  to  the  restricted  form 
of  the  Macready  and  Dayton  model.  He  generated  a function  for  identify- 
ing optimal  cutoff  scores  in  terms  of  relative  costs  of  incorrect 
mastery/nonmastery  decisions  and  the  ratio  of  a to  b errors.  The 
optimized  formula  is 


where 

k = percentage  of  items  correct  required  for  a mastery 
decision ; 

Li  = loss  incurred  from  a false  positive; 

L2  = loss  incurred  from  a false  negative. 

This  cutscore  value  is  the  same  as  that  suggested  by  Macready  and 
Dayton  under  their  restricted  model  when  the  same  parameter  estimates 
are  used.  However,  Emrick  suggests  a different  approach  for  parameter 
estimation.  He  constructs  a fourfold  table  relating  true  mastery  state 
and  observed  item  responses  to  a single  item,  with  the  cell  entries 
being  the  error  probabilities  a and  b.  Emrick  then  treats  a and  b as 
response  contingencies  and  computes  a phi  coefficient  to  indicate  the 
correlation  between  observed  single  item  responses  and  true  mastery 
state : 


He  uses  the  average  iteritem  correlation  of  examinee  responses  to  com- 
pute an  unbiased  estimate  of  the  reliability  of  a single  item  using  the 
Spearman-Brown  prophecy  formula. 

Since  reliability  is  defined  as  the  proportion  of  total  variance 
that  is  true  variance,  it  can  be  interpreted  as  an  unbiased  estimate  of 
the  squared  correlation  between  an  examinee’s  true  mastery  state  and  his 


13 


Table  4 


Probability  of  Misclassification  as  a Function  of  Cutting 
Score  Under  the  Macready  and  Dayton  Models, 
Assuming  P(M)  = P(M)  = .5 


Cutting 

P (False 

negative) 

P (False 

positive) 

P (Misclassification) 

score 

Model  I 

Model  II 

Mode 1 I 

Model  II 

Model  I 

Model  II 

0 (all 
pass) 

0 

0 

.5 

.5 

.5 

.5 

1 

.0005 

.000864 

.076775 

.084708 

.077275 

.085572 

2 

.01650 

. 019872° 

. 0032c 

.005184d 

.0197 

.025056 

3 

.1580 

.159264 

.000025 

.000108 

.158025 

.159372 

4 (all 
fail) 

.5 

.5 

0 

0 

.5 

.5 

a,  b 

The  probability  that  a master  will  be  misclassif ied  when  the  cutoff 
score  is  set  at  2 correct  equals  the  sum  of  the  probabilities  that  a 
master  will  get  only  0 or  1 items  correct  times  the  proportion  of  mas- 
ters in  the  group.  For  Model  I,  this  probability  equals  .0005  + .0095 
+ .0045  + .002  = .0165.  For  Model  II,  .000864  + 3(. 006336)  = .019872. 

c,  d 

The  probability  that  a nonmaster  will  be  misclassified  when  the 
cutoff  score  is  set  at  2 correct  equals  the  sum  of  the  probabilities 
that  a nonmaster  will  get  2 or  3 items  correct  times  the' proportion  of 
nonmasters  in  the  group.  For  Model  I,  this  probability  equals  .000025 
+ .002475  + .000475  + .000225  = .0032.  For  Model  II,  .000108  + 

3(. 001692)  = .005184. 
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or  her  item  response.  Hence,  item  responses,  true  mastery  state,  and 
error  probabilities  can  be  directly  related  through  the  test  reliabil- 
ity. If  the  ratio  of  a to  b is  known  (or  if  it  can  be  estimated) , 
values  for  a and  b can  be  directly  calculated. 

For  the  Macready-Dayton  model  example  values  (a  = .06,  b = .12) , 
the  value  of  phi  is  .821.  Squaring  this  value  and  applying  the  Spearman- 
Brown  prophecy  formula  for  a three-item  test  indicates  that  the  test  re- 
liability for  this  example  would  be  .86.  Assuming  a loss  ratio  of  1 and 
equal  proportions  of  masters  and  nonmasters,  the  value  for  k in  Emrick's 
optimization  formula  is  .4339.  This  implies  a cutting  score  of  1.3  on 
a three-item  test,  or  rounding  up  to  the  next  higher  integer,  2.  Thus, 
the  final  result  is  the  same  as  the  result  obtained  with  Macready  and 
Dayton. 

Evaluation.  An  important  constraint  of  this  approach  is  that  the 
proportion  of  masters  and  nonmasters  must  be  equal.  (The  computations 
for  the  preceding  example  and  a more  general  form  of  the  Emrick  model 
are  presented  in  Appendix  A.) 

Other  possible  weaknesses  in  Emrick's  approach  to  parameter  esti- 
mation are  the  subjectivity  required  and  the  somewhat  overly  restric- 
tive assumptions  necessary  to  implement  his  approach.  In  addition,  the 
complexity  of  both  conceptualizing  and  quantifying  Iq  and  L2  may  greatly 
complicate  the  derivation  of  cutoff  scores  under  these  models. 

If  the  assumptions  are  met,  an  optimal  differentiation  between 
masters  and  nonmasters  will  result.  Furthermore,  a means  is  provided 
to  determine  how  many  items  are  needed  to  keep  the  probability  of  mis- 
classification  at  or  below  some  specified  critical  level.  The  relation- 
ships among  test  items  may  also  be  explored.  A major  potential  weakness 
concerns  the  assumption  that  learning  occurs  in  an  "all-or-none"  manner, 
with  no  partial  learning  or  overlearning.  Failure  to  satisfy  this  as- 
sumption could  produce  a poor  fit  of  data  to  the  model,  which  will  in 
turn  produce  a far  less  than  optimal  cutting  score . 

Binomial  Model 

Assumptions  and  Rationale . In  contrast  to  the  all-or-none  learn- 
ing assumption  of  the  Emrick  and  Macready  models  is  the  assumption  that 
learning  is  a continuous  process.  A binomial  distribution  model,  first 
suggested  and  derived  by  Kriewall  (1969)  and  subsequently  developed  by 
Millman  (1972) , defines  proficiency  as, the  probability  that  a person 
will  correctly  respond  to  any  test  item  randomly  chosen  from  a speci- 
fied domain  of  items.  Proficiency  may  also  be  defined  as  the  propor- 
tion of  items  that  would  be  correct  if  all  items  in  the  domain  could 
be  administered.  Since  the  proficiency  value  can  take  on  values  from 
zero  to  one,  the  model  allows  for  partial  acquisition. 
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The  following  assumptions  are  pertinent:  (1)  dichotomously  scorable 
items,  (2)  local  independence  of  items,  (3)  no  systematic  learning  or 
forgetting  during  test  taking,  and  (4)  items  equally  difficult  for  any 
given  examinee.  The  percentage  of  items  answered  correctly  is  taken  as 
a point  estimate  of  the  examinee's  true  proficiency.  For  a given  pro- 
ficiency, the  probability  of  observing  any  score  may  be  determined.  The 
hypothesis  to  be  tested  in  this  model  involves  the  likelihood  of  a speci- 
fic score,  if  indeed  the  examinee  had  the  given  level  of  proficiency. 

The  basic  equation  for  the  binomial  model  yields  the  probability 
distribution  of  scores  for  an  examinee  with  proficiency  "p"  for  repeated 
random  samples  of  items  of  size  "n"  from  a given  domain  of  items: 


f ( x) 


a - p) 


where 


x = the  total  number  of  correct  responses, 
f(x)  = the  probability  of  test  score  x. 


(0 


the  binomial  coefficient: 


n! 


(n  - x)  ! 


(5) 


The  binomial  model  can  be  used  to  provide  two  types  of  information. 
First,  the  proportion  correct  is  the  maximum  likelihood  estimate  of  an 
individual’s  proficiency  relative  to  the  particular  domain.  Second,  the 
model  can  be  used  to  investigate  the  interaction  between  test  length  and 
classification  error  when  individuals  are  divided  into  two  groups.  One 
group  will  contain  students  with  proficiency  greater  than  or  equal  to 
some  minimal  proficiency  criterion.  The  other  group  will  have  students 
with  proficiency  levels  less  than  or  equal  to  some  maximum  nonmastery 
criterion. 


To  calculate  the  expected  error  in  decisionmaking,  it  is  necessary 
to  specify  two  parameters.  The  first  is  the  lowest  proficiency  level 
required  for  an  individual  to  be  considered  a master.  The  second  is 
the  highest  proficiency  level  that  a student  could  obtain  and  still  be 
considered  a nonmaster.  When  these  values  are  set  by  the  decisionmaker, 
the  probability  of  false  negative  and  false  positive  errors  for  minimal 
masters  and  maximal  nonmasters,  respectively,  can  be  calculated  for  any 
given  test  length  and  cutting  score.  This  procedure,  it  should  be  noted, 
is  generally  conservative.  That  is,  if  the  group  contains  examinees 
with  abilities  above  minimal  mastery  or  below  maximal  nonmastery,  the 
number  of  misclassifications  observed  will  be  less  than  that  predicted 
by  the  model. 
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Example.  Suppose  that  a cutoff  score  of  80%  correct  was  selected 
(i.e.,  in  order  to  be  classified  as  a master,  a student  must  get  cor- 
rect at  least  80%  of  whatever  number  of  items  are  included  on  the  test) . 
Assume  also  that  a true  proficiency  of  90%  is  defined  as  the  minimal 
mastery  level,  and  that  a true  proficiency  of  70%  is  defined  as  the 
maximal  nonmastery  level.  The  region  between  these  cutoff  scores  is 
an  "area  of  indifference."  That  is,  if  an  examinee's  true  proficiency 
lies  between  70%  and  90%,  the  decisionmaker  would  be  indifferent  as  to 
whether  the  examinee  is  classified  as  a master  or  as  a nonmaster. 

Values  for  misclassification  error  that  can  be  tolerated  must  also 
be  specified.  Continuing  with  the  above  example,  assume  that  the  de- 
cisionmaker is  unwilling  to  accept  more  than  26%  of  the  students  whose 
true  ability  is  70%,  and  he  or  she  wants  to  reject  not  more  than  19%  of 
those  whose  true  ability  is  90%.  Thus,  the  probabilities  of  a false 
positive  and  false  negative  are  .26  and  .19,  respectively.  Given  these 
values,  it  is  possible  to  determine  the  minimal  number  of  test  items. 

The  following  notation  will  be  used: 


n = the  total  number  of  test  items, 

c = the  cutoff  score  (in  this  example  c = .8n  or  the  next  highest 
integer  value  of  .8n  since  an  80%  standard  was  chosen) , 
x = the  observed  score,  and  the  formula  for  cumulative  terms  of 
the  binomial  distribution  is 


(6) 


Specifying  that  the  probability  of  falsely  rejecting  a master  must 
not  exceed  .19  means  that  the  cumulative  probability  of  a master  ob- 
taining a score  from  0 correct  to  c - 1 correct  must  not  exceed  .19. 
This  constraint  may  be  expressed  as  the  inequality 


F(x  < c - 1)  < .19. 


(7) 


Therefore , 


where  p = .9,  the  minimal  mastery  level. 

A similar  relationship  exists  for  nonmasters.  Since  the  probabil- 
ity of  falsely  accepting  a nonmaster  must  not  exceed  .26,  the  cumulative 
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probability  of  a nonmaster  obtaining  a score  greater  than  or  equal  to 
c must  not  need  exceed  .26.  The  inequality  for  nonmasters  is 


F (x  > c)  < .26. 


(8) 


Therefore, 

.26  < E H (.7)X  (.3)n  ~ X, 

where  p = .7,  the  maximal  nonmastery  level. 

Reference  to  a table  of  cumulative  terms  of  the  binomial  distribu- 
tion shows  that  the  minimum  value  of  n for  which  these  relationships 
hold  is  8. 


Since  .8  (8)  = 6.4,  a cutoff  score  of  7 correct  is  chosen.  Sub- 
stituting these  values  for  c and  n yields 


(9) 


(10) 


These  are  the  numerical  solutions  for  the  above  inequalities. 

The  conservative  nature  of  the  model  results  from  the  fact  that 
the  calculations  are  based  on  two  point  values  of  true  proficiency, 

70%  and  90%.  The  previous  calculations  reflect  the  probabilities  of 
false  positives  and  false  negatives,  assuming  that  the  examinee  group 
is  composed  only  of  people  with  true  proficiencies  of  70%  and  90% . How- 
ever, if  an  examinee  had  a true  proficiency  of  95%,  the  probability  that 
he  or  she  would  obtain  a score  of  less  than  seven  correct  out  of  eight 
items,  and  therefore  be  classified  as  a nonmaster,,  may  be  expressed  as 


x 


= 6 
£ 


x = -0 


{ .95)  X 


(.05)® 


x 


.06. 


(ID 


This  value  is  considerably  less  than  the  probability  of  a false  negative 
as  previously  obtained,  .19. 

On  the  other  hand,  if  a person  had  a true  proficiency  equal  to  60%, 
the  probability  that  he  or  she  would  obtain  a score  of  seven  or  more 


correct  on  an  eight-item  test,  and  therefore  be  classified  as  a master, 
may  be  expressed  as 


E Q (.6)X  (.4)8  X = .11.  (12) 

This  value  is  much  less  than  the  probability  of  a false  positive  as  pre- 
viously obtained,  .26. 

Millman  (1972)  has  prepared  tables  which  allow  the  decisionmaker 
to  reach  these  same  conclusions  without  calculations.  His  tables  also 
give  the  expected  misclassification  error  for  a variety  of  test  lengths, 
cutoff  percentages,  and  true  ability  levels. 

Evaluation.  The  binomial  model  actually  describes  the  worst  pos- 
sible situation.  For  most  practical  applications,  the  examinee  popula- 
tion will  contain  persons  with  true  ability  above  the  minimal  mastery 
level  and  below  the  maximal  nonmastery  level.  To  arrive  at  a more 
realistic  estimate  of  total  misclassification,  the  equations  would  have 
to  be  solved  for  each  representative  ability  and  be  weighted  by  the 
proportion  of  the  group  with  each  ability.  Such  a procedure  is,  of 
course,  feasible  but  its  value  is  questionable.  The  values  obtained 
from  the  simple  procedure  are  overly  pessimistic;  any  decision  derived 
from  empirical  data  could  be  no  worse,  and  would  probably  be  better. 

A virtue  of  this  model  is  that  it  is  relatively  straightforward, 
being  based  on  the  familiar  binomial  distribution.  It  is  one  of  the 
simpler  quantitative  models  to  derive  test  lengths  and  cutting  scores. 
The  model  can  be  criticized,  however,  because  of  its  conceptual  founda- 
tions. Specifically,  the  output  of  the  model  tells  us  the  probability 
that  a student  will  attain  a certain  test  score,  given  his  or  her  true 
ability  level.  However,  it  is  by  no  means  clear  or  obvious  that  the 
decisionmaker  would  know  the  student's  true  level  of  functioning.  In- 
deed, if  the  true  ability  level  were  known,  there  would  be  no  need  for 
models  to  determine  test  length  and  cutting  scores.  In  using  the  bino- 
mial model,  the  decisionmaker  has  to  set  estimated  (or  desired)  limits 
on  the  true  level  of  functioning  of  the  student.  This  allows  him  or 
her  to  infer  the  conditional  probability  of  the  observed  test  score, 
given  the  hypothesized- level (s)  of  proficiency.  This  binomial  model 
is  most  useful  for  initial  approximations  of  test  length  and  cutting 
score  before  test  data  have  been  collected. 


Bayesian  Model 


Assumptions  and  Rationale.  If  information  can  be  obtained 
the  quality  of  the  examinee  population  (perhaps  on  the  basis  of 
vious  similar  populations)  before  the  test  scores  are  observed, 
Bayesian  model  may  be  appropriate  for  deriving  test  lengths  and 
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scores.  The  input  consists  of  an  estimate  of  the  ability  distribution 


in  the  examinee  population,  and  the  conditional  probabilities  that  a 
randomly  chosen  item  would  be  answered  correctly  given  some  ability 
level.  The  output  is  the  conditional  probability  that  an  individual's 
ability  equals  (or,  in  some  cases,  exceeds)  some  criterion  ability, 
conditional  upon  his  or  her  test  score. 

The  Bayesian,  like  the  binomial,  model  makes  the  following  assump- 
tions: (1)  items  must  be  dichotomously  scored,  (2)  responses  are  inde- 

pendent, (3)  items  are  equally  difficult  for  any  given  examinee  within 
a particular  ability  group,  and  (4)  there  is  no  systematic  learning  or 
fatigue  during  test  taking.  As  in  the  binomial  model,  ability  is  de- 
fined as  the  probability  of  responding  correctly  to  a randomly  chosen 
item  from  the  domain.  We  will  continue  to  use  the  term  proficiency 
(p)  when  referring  to  this  definition  of  ability. 

Examples . The  first  model  to  be  discussed  assumes  i _>  2 discrete 
states  of  mastery. 

Epstein  and  Steinheiser  (1975)  developed  a two-step  algorithm  based 
on  work  by  Hershman  (1971) . The  first  step  yields  the  probability  of 
an  examinee  being  in  mastery  state  i,  conditional  on  an  item  score: 

p (t | M . ) p (M. ) 

p(M.|t)  = 2 2 , (13) 

s 

l p ( 1 1 M . ) p (M.  ) 
i = 1 1 1 

where  s = the  number  of  states, 

t = the  item  score  (0  or  1) , 

= the  mastery  state  being  considered, 
p(M^)  = the  prior  probability  that  an  individual  is  in  mastery 
state  i,  and 

p(t|M^)  = the  probability  of  the  score  t,  given  the  mastery  state. 


The  second  step  in  the  procedure  combines  the  decisions  for  each 
item  into  a final  probability  of  being  in  mastery  state  i,  given  the 
total  test  score: 


where 


j = 1,  2,  ...  n = the  number  of  items  and 
T = the  total  test  score. 

For  example,  consider  the  case  previously  described  for  the  bino- 
mial model.  Two  mastery  states  are  assumed,  minimal  mastery  and  maxi- 
mal nonmastery. 

For  the  minimal  mastery  state  (M^) , p(tj  = correct  (1)  | M^)  = .9  and 
p(tj  = incorrect  (0)^)  = .1,  for  all  j. 

For  the  maximal  nonmastery  state  (M2) , p(tj  = correct  (1) |m2>  = .7 
and  p(tj  = incorrect  (0) |M2)  = .3. 

Values  must  be  given  for  the  priors,  p(Mi)  and  p(M2) . Their  value 
may  be  determined  on  the  basis  of  past  experience,  or  may  simply  re- 
flect the  beliefs  or  expectations  of  the  evaluator.  Three  cases  will 
be  considered:  p(Mx)  = p(M2)  = .5;  p(Mi)  = .12,  p(M2)  = .88;  and 
p(M^)  = .62,  p(M2)  = .38.  These  correspond  to  little  prior  informa- 
tion, relatively  low  expectations,  and  relatively  high  expectations. 

The  example  was  computed  for  an  observed  score  of  seven  correct  on  an 
eight-item  test.  The  results  are  shown  in  Table  5. 

For  Cases  2 and  3,  where  prior  information  favored  the  nonmastery 
and  mastery  states,  the  final  decision  can  be  made  with  a relatively 
high  degree  of  confidence.  For  the  case  of  little  prior  information, 
Case  1,  the  probabilities  of  misclassification  are  greater.  The  ef- 
fects on  the  final  decision  of  the  priors  are  also  clear.  For  the  equal 
priors  case,  the  weight  of  the  observed  evidence  favors  a mastery  deci- 
sion. However,  where  the  nonmastery  state  is  favored  in  the  prior 
probabilities  (Case  2) , the  evidence  does  not  overcome  the  priors  and 
a nonmastery  decision  is  made. 

Whereas  the  Epstein  and  Steinheiser  technique  seems  to  offer  a 
method  for  reducing  the  uncertainty  in  decisionmaking  for  a given  num- 
ber of  test  items,  their  procedure  is  limited  by  the  constraint  that 
only  discrete  mastery  groups  are  considered.  The  second  model  to  be 
reviewed  deals  with  continuous  distributions  of  proficiency  and  classi- 
fies examinees  based  upon  the  probability  that  their  proficiency  equals 
or  exceeds  some  minimal  criterion.  Novick  and  Lewis  (1974)  achieve  this 
by  assuming  that  the  distribution  of  examinee  proficiencies  can  be  ap- 
proximated by  a member  of  the  family  of  Beta  distributions.  The  prob- 
ability of  achieving  any  score  of  interest,  given  the  proficiency, 
remains  binomial.  The  form  of  Bayes'  Theorem  is  then  a probability 
density  function  of  the  form  p(T|x)  = p(x|T)p(T) , where  T is  the  pro- 
ficiency and  x is  the  test  score. 

If  p(x|T)  is  binomial  and  p(T)  is  a Beta  distribution,  then  p(T|x) 
will  also  be  a member  of  the  Beta  family.  In  fact,  if  the  prior 
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Table  5 


Changes  in  Posterior  Probability  of  Mastery  as  a Function 
of  Changes  in  Prior  Probability  of  Mastery 


Prior 

P(MX) 

.5 

.12* 

.62 

p(m2) 

.5 

.88 

.38 

Posterior 

p(m  |t) 

.66 

.205* 

.767 

p(m2|t) 

.33 

.796 

.242 

♦Computational 

steps:  p(t. 

= 1)  = .12  x .9  + .88  x .7  = .724 

p(tj  = 0)  = .12  x .1  + .88  x .3  = .276 
ptMjt^  = 1)  = (.12  x .9)/. 724  = .149 

p (M^ | t^  = 0)  = (.12  x . 1) /. 276  = .043 

IIpfMjtj)  = .1497(.043)  = 7 x 10_8 

p(M1|T)  = 7 x 10~8/[(.127) (7  x 10_8/.127  + .309/.887)]  = .205 
p(M2|T)  = . 309/[ . 88? (7/36  + .755)]  = .796 
p (^  1 1 j = 1)  = (.88  x .7)/. 724  = .851 

p(M2|tj  = 0)  = (.88  x .3)/. 276  = .957 

np(M2|t.)  = . 851? ( . 956)  = .309 
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distribution  is  Beta  (a,b)  (i.e.,  B(a,b)),  and  a score  of  x is  ob- 

served in  n trials,  then  the  posterior  distribution  is  B(a  + x, 
b + n - x)  . 


Continuing  with  the  previous  example  in  the  continuous  framework, 
we  shall  now  consider  three  prior  distributions.  Integer  values  of 
a and  b,  the  parameters  of  the  Beta  distribution,  will  be  used.  We 
may  therefore  use  the  Incomplete  Beta  function  Ip(a,  b) , which  has  the 
following  relationship  to  the  cumulative  binomial  distribution: 


x'  + 1) , 


(15) 


where  n is  the  number  of  test  trials,  p is  the  probability  of  success 
on  a randomly  selected  trial,  and  x'  is  the  observed  number  of  successes. 


Tabled  values  are  available  (Beyer,  1966,  Table  III. 2)  . For  non- 
integer values  of  a and  b,  programed  numerical  methods  may  be  required 
(Novick  S Jackson,  1974) . 


For  the  first  example,  assume  that  little  is  known  about  the  exami- 
nee population,  i.e.,  a randomly  selected  examinee  may  get  a test  score 
that  would  place  him  or  her  in  the  mastery  or  nonmastery  category  with 
equal  probability.  In  terms  of  the  Beta  distribution,  this  means  that 
examinee  proficiency  would  be  rectangularly  distributed,  resulting  in 
a = 1,  b = 1,  or  B(l,  1)  (Novick  & Jackson,  1974,  p.  114). 

For  the  second  case,  assume  that  the  prior  probability  that  a ran- 
domly chosen  examinee  has  proficiency  greater  than  or  equal  to  .8  is 
.12,  i.e.,  P(p  .8)  = .12.  Therefore,  1 - p is  used  to  enter  the 
cumulative  binomial  table  at  the  top  (since  tabled  p values  stop  at 
p = •50),  and  .12  is  the  table  value. 

However,  we  cannot  use  the  table  until  one  more  parameter  is  speci- 
fied; so  let  us  assume  that  the  examiner's  "certainty  of  prior  belief" 
can  be  quantified  as  being  equivalent  to  the  information  that  would  be 
available  if  a 10-item  test  were  given  (Winkler,  1972,  p.  187)  . With 
n = 10,  we  find  that  an  entry  with  a value  of  .12  in  the  .20  column 
for  n = 10  has  an  associated  x'  value  equal  to  4.  Unfortunately,  x' 
does  not  equal  4,  due  mainly  to  a limitation  of  the  table,  since  p 
values  stop  at  .50  and  do  not  extend  to  .80  or  beyond.  Note,  however, 
that  if  we  let  x'  = 4 in  the  cumulative  binomial,  and  subtract  the 
result  from  1,  we  obtain 

x n - x,  which  equals  1 - .1208,  or  .88. 

( . o) 
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If  the  table  had  extended  to  p = .8,  then  the  value  .879  would  have 
been  found  as  the  entry  corresponding  to  n = 10  and  x'  = 7.  Hence,  the 
value  for  x'  is  7.  Substituting  x’  = 7 and  n = 10  in  equation  (15) , we 
obtain  Ip(7,  4)  as  the  Beta  distribution  which  represents  the  prior 
information  that  P(p  .8)  = .12  is  equivalent  to  10  additional  test 
trials. 

The  third  example  considers  that  the  prior  probability  of  a ran- 
domly chosen  examinee  having  proficiency  greater  than  or  equal  to  .8 
is  .62 — which  is  also  comparable  to  information  that  could  be  obtained 
from  a 10-item  test.  Again,  entering  the  table  with  n = 10,  1 - p = .2, 
we  find  that  a tabled  value  of  .62  this  time  corresponds  to  x'  =2. 
Substituting  x'  = 2 in  the  cumulative  binomial  and  subtracting  that 
result  from  1 yields  .38.  Again,  an  extension  of  the  table  to  p = .8 
would  show  that  when  n = 10,  a tabled  value  of  .38  corresponds  to  an 
x*  value  of  9.  Therefore,  the  parameters  for  the  Beta  distribution  in 
this  case  are  Ip(9,  2)  . 

Having  thus  derived  the  prior  distributions,  let  us  now  consider 
some  hypothetical  test  scores,  and  then  derive  the  posterior  distributions. 

Suppose  that  a score  of  seven  correct  on  an  eiqht-item  test  were 
observed.  Then  the  posterior  proficiency  distributions  will  be  B(a  + 
number  correct,  b + number  of  trials  - number  correct) . For  the  three 
examples,  we  therefore  have  B(8,  2),  B(14,  5),  and  B(16,  3)  . 

The  posterior  probability  that  an  examinee  with  a score  of  seven 
correct  out  of  eight  items  has  a proficiency  greater  than  or  equal  to 
.8  (i.e.,  P(p  > .8  | 7,  8))  can  be  found  by  determining  the  area  in  the 
upper  tail  of  the  appropriate  Incomplete  Beta  function  (Winkler,  1972, 

Table  5;  Schlaifer,  1969,  Table  T3;  Novick  & Jackson,  1974,  Table  A-14) . 

For  the  three  examples,  these  values  are:  I (8,  2)  = .56;  I (14,  5)  = 
.28;  and  I Q(16,  3)  = .73.  *8  *8 

• O 

Since  the  origin  of  these  values  may  not  be  intuitively  obvious, 
we  shall  outline  the  steps  required  to  complete  the  first  example,  usinq 
the  Novick  and  Jackson  tables. 

Step  1:  Since  p > q,  reverse  the  order,  and  enter  the  table  with 
p = 2 and  q = 8. 

Step  2:  The  table  gives  the  cumulative  area  (of  proficiency); 
however,  since  we  want  to  determine  the  area  in  the  upper  part  of  the 
Beta  function,  we  need  to  subtract  the  stated  proficiency  of  .8  from 
1,  and  thereby  obtain  .2.  This  represents  the  symmetric  area  in  the 
lower  20%  of  the  distribution. 

Step  3:  .2  lies  between  the  tabled  values  of  .1796  and  .2723, 

with  associated  probabilities  (fractiles)  of  those  tabled  proficiencies 
equal  to  50%  and  75%,  respectively. 
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Step  4:  Interpolation  yields  the  fact  that  a 20%  or  less  pro- 
ficiency would  occur  56%  of  the  time;  therefore,  80%  or  greater  pro- 
ficiency should  also  be  observed  56%  of  the  time. 

Novick  and  Jackson  also  provide  a convenient  set  of  charts  (pp. 
122-123)  for  rapid  approximations,  although  it  should  be  noted  that  for 
the  current  example,  the  solution  is  found  to  be  .44  from  their  chart 
A.  This  value  must  be  subtracted  from  1,  since  the  .44  represents  the 
cumulative  area  in  the  lower  portion  of  the  B(8,  2)  curve. 


If  the  probability  of  having  a proficiency  greater  than  or  egual 
to  .8  must  be  at  least  .5  for  an  examinee  to  be  classified  as  a master, 
then  a score  of  7 out  of  8 would  lead  to  a mastery  classification  only 
in  the  first  and  third  examples  previously  described.  The  weight  of 
the  low  prior  reversed  the  decision  rule  in  the  second  example. 


For  another  approach  to  deriving  prior  distributions,  assume  that 
prior  information  can  be  described  as  equivalent  to  7 correct  on  a 10- 
item  test.  (This  is  an  assumption  not  without  criticism,  as  we  shall 
note  in  a subsequent  section.)  Assume  also  that  proficiency  is  dis- 
tributed as  Beta — a helpful  and  reasonably  appropriate  assumption.  The 
mean  of  the  examinees'  proficiency  then  equals  (x/n  + 1)  or  7/11  = .636 
The  variance  equals  x (n  - x + 1) / (n  + 1) 2 (n  + 2)  = 28/1452  = .019. 

Since  the  parameters  are  integers,  we  may  once  again  use  the  cumulative 
binomial  as  a means  of  obtaining  the  Incomplete  Beta  density  function: 


I (7,  4) 
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(16) 
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Equation  (16)  is  the  probability  that  a given  proficiency  is  less 
than  or  equal  to  p.  We  can  compute  this  probability  by  assigning  spe- 
cific values  to  p,  as  shown  in  Table  6.  The  values  for  P(p  p)  up 
to  the  50th  fractile  may  be  found  directly  (Beyer,  1966,  Table  III. 2) 
for  x'  =7  and  n = 10.  Values  for  .6  and  greater  can  be  computed  ac- 
cording to  the  cumulative  binomial  equation  (16) . When  the  values 
obtained  (as  in  Table  6)  are  plotted,  the  result  is  a smooth  ogive- 
like curve  (Winkler,  1972,  pp.  153,  186;  Schlaifer,  1969,  p.  438) . 

To  plot  the  proficiency  distribution,  we  may  use  the  Beta  distribu- 
tion function : 

T (a  = b)  ua  1 (1  - u) b 1 

f(p)  = . (17) 

r (a)  r (b) 
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Table  6 


Cumulative  Estimation  of  Prior  Probabilities  for 
Various  Assumed  Proficiencies 


p = Proficiency 

I (7,4) , or 

P(P  1 P) 

.1 

.0000 

.2 

.0009 

.3 

.0106 

.4 

.0548 

.5 

.1719 

. 6 

.3823 

.7 

.6496 

.8 

.8791 

9 


.9872 


Values  of  the  proficiency  (p)  may  be  chosen,  but  a = x'  = 7,  and 
b=n-x'  +1=4.  Since  (n  + 1)  = n!  for  integers,  we  can  easily 
solve  equation  (16) : T (a  + b)  = T ( 11)  = 10 ! = 3.6288  x 106;  T(a)  = 
r ( 7)  = 6!  = 7.2  x 1()2;  T (b)  = T(4)  = 3!  = 6.  Therefore,  T (a  + b)  / 

F(a)  T (b)  = 3.6288  x 10^/(720) (6)  = 840.  Table  7 shows  how  values  of 
f(p)  may  be  obtained. 

A plot  of  the  tabled  values  for  p on  the  abscissa  and  f (p)  on  the 
ordinate  could  then  be  made.  Such  plots  may  also  be  found  in  Winkler 
(1972,  sec.  4.3  and  4.4),  Schlaifer  (1969,  sec.  11.1.2)  and  Novick  and 
Jackson  (1974,  p.  112) . Note  that  this  is  a prior  distribution  of 
hypothesized  proficiencies  in  which  we  assumed  at  the  outset  that  the 
information  could  be  characterized  as  comparable  to  the  information 
that  would  be  obtained  from  observing  a score  of  seven  correct  on  a ten- 
item  test. 

Evaluation.  Bayesian  models  offer  the  possibility  of  enhancing 
the  assessment  of  examinee  proficiency  by  using  prior  information,  e.g., 
knowledge  that  content  experts  or  examiners  have  about  previous  similar 
examinee  populations.  As  the  validity  and  accuracy  of  this  prior  in- 
formation increases,  fewer  test  items  will  be  needed  to  achieve  a given 
level  of  classification  accuracy  in  comparison  to  the  binomial  model 
and  in  comparison  to  the  Bayesian  case  of  equal  priors.  As  more  is 
known  about  the  examinee  population  (i.e.,  the  more  that  prior  informa- 
tion departs  from  a B(l,  1)  distribution) , the  more  the  variability  in 
the  posterior  distribution  is  reduced,  and  the  more  the  number  of  items 
to  attain  a desired  level  of  accuracy  is  reduced. 

In  comparing  the  binomial  and  Bayesian  models,  note  that  the  former 
produced  as  output  the  probability  of  observing  a specific  score  condi- 
tional upon  some  hypothesized  true  ability  level.  In  the  spirit  of 
classical  hypothesis  testing,  one  need  not  know  anything  about  an  exami- 
nee's proficiency,  except  that  he  or  she  is  more  or  less  likely  to  come 
from  the  mastery  side  of  the  cutoff  score.  Since  some  true  level  of 
functioning  must  be  hypothesized,  it  is  possible  to  determine  the  prob- 
abilities of  falsely  passing  a nonmaster  and  falsely  failing  a master 
if  the  test  score  suggests  a true  proficiency  level  either  above  or 
below  the  hypothesized  true  level  of  functioning. 

In  contrast,  the  Bayesian  model  provides  as  output  the  probability 
that  a specific  examinee  has  a true  ability  equal  to  or  greater  than  the 
criterion  (minimal)  ability,  conditional  upon  the  observed  test  score. 

But  since  no  true  ability  was  hypothesized,  false  positive  and  false 
negative  error  rates  cannot  be  specified  as  was  possible  with  the  bino- 
mial model.  While  both  models  give  the  probability  that  an  examinee  is 
a member  of  some  ability  level  group,  the  binomial  estimate  refers  to 
the  probability  of  a score  occurring  conditional  upon  the  assumed  true 
proficiency;  whereas  the  Bayesian  estimate  refers  to  the  probability  of 
a specific  examinee  being  at  or  beyond  some  proficiency  level  conditional 
upon  his  or  her  observed  test  score. 
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Table  7 

Point  Values  for  Prior  Proficiency  Distribution 


Proficiency 

values 

a - 1 
P 

(1 

- p)b  - 1 

f(p)  = 840 (p); 

*-1(1-P>b-1 

.1 

7.29 

X 

io-7 

6.12 

X 

io-4 

.2 

3.28 

X 

IQ'5 

2.75 

X 

io-2 

.3 

2.50 

X 

io"4 

2.10 

X 

r— 1 

1 

o 

.4 

8.85 

X 

io"4 

7.44 

X 

io-1 

.5 

1.95 

X 

io'3 

1.64 

X 

10° 

.6 

2.99 

X 

io"3 

2.51 

X 

10° 

.7 

3.18 

X 

io~3 

2.67 

X 

10° 

.8 

2.10 

X 

io-3 

1.76 

X 

10° 

.9 

5.31 

X 

io"4 

4.48 

X 

io-1 

There  are  several  difficulties  confronting  the  potential  user  of  a 
Bayesian  model  for  CRT  purposes.  First,  the  mathematics  can  become 
rather  cumbersome,  since  the  Beta  distribution  must  be  used  when  ability 
is  assumed  to  be  distributed  continuously.  Second,  a methodological 
difficulty  arises  in  the  determination  of  prior  probabilities  (Winkler, 
1972,  sec.  4.8).  It  is  methodologically  unsound  to  merely  ask  the  exam- 
iner or  expert  to  "state  his  priors,"  since  simple  human  judgment  of 
probabilities  is  often  unreliable,  inconsistent,  and  distorted  (Kaplan 
& Schwartz,  1975)  . A method  used  in  the  present  paper — equating  prior 
information  to  comparable  test  length  and  score  information — may  be 
suitable  for  purposes  of  illustration,  but  it  may  be  difficult  to  im- 
plement in  applied  settings. 

There  is  at  present  a dearth  of  research  about  how  prior  probabili- 
ties can  actually  be  obtained  from  experts.  Perhaps  a pair  comparison 
or  forced-choice  procedure  could  be  used  in  which  various  combinations 
of  proficiency  (or  expected  scores)  and  associated  probabilities  are 
presented  to  the  expert  (Steinheiser , 1976).  Thus,  the  judge’s  prior 
distribution  would  be  directly  obtained,  and  the  best  fitting  Beta 
distribution  used  to  provide  the  necessary  parameter  values. 


Rasch's  One- Parameter  Logistic  Model 

Assumptions  and  Rationale . The  latent  trait  model  developed  by 
Rasch  (1960,  1961,  1966)  is  claimed  to  yield  person-free  test  calibra- 
tions and  item- free  person  measurements  (Wright  & Panchapakesan,  1969) . 
The  model  attempts  to  reproduce  an  item  by  score  group  matrix  in  which 
n items  are  ordered  by  their  difficulties,  and  n - 1 score  groups  are 
ordered  by  the  raw  scores.  Cell  entries  represent  the  probability  that 
item  i will  be  passed  by  a person  in  score  group  j (Whitely  & Dawis, 
1974)  . 

There  are  two  parameters  in  the  model.  The  first  is  person  ability 
A;  the  second  is  item  difficulty  D.  The  odds  (O)  of  a person  correctly 
answering  an  item  are  equal  to  the  product  of  the  person's  ability  times 
the  item's  difficulty:  O = A x D.  If  we  express  the  odds  as  a prob- 
ability, we  find  that  the  probability  P of  a person  with  ability  A suc- 
ceeding on  an  item  with  difficulty  D can  be  expressed  as  A x D 

P = • 

1 + A x D 

Replacing  A and  D with  their  logarithms,  log  A = a and  log  D = d,  we 
may  finally  express  P as  a logistic  function  (Wright,  1967) : 


P 


1 


1 + e 


(-a  - d) 


(18) 


This  model  assumes  that  (1)  all  items  measure  the  same  unidimen- 
sional trait;  (2)  all  items  have  equal  discriminating  power  and  vary 
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only  in  difficulty  (the  restriction  of  a common  discrimination  index 
results  in  a set  of  nonintersecting  item  characteristic  curves  which 
differ  only  by  a translation  along  the  ability  scale) ; (3)  subjects 
and  items  are  locally  independent;  (4)  guessing  effects  are  negligible, 
and  (5)  there  is  no  time  constraint  on  answering  items  (Rasch,  1966)  . 

Tests  comprised  of  items  all  of  which  fit  the  model  have  the  fol- 
lowing properties  (Wright  & Panchapakesan , 1969;  Whitely  & Dawis,  1974) : 
(1)  estimates  of  item  difficulty  parameters  will  net  differ  signifi- 
cantly for  any  sample  of  examinees;  (2)  estimates  of  person  ability 
will  not  differ  significantly  for  any  sample  of  calibrated  items;  (3) 
individual  ability  estimates  can  be  measured  on  at  least  an  interval, 
and  perhaps  a ratio  scale  (Wright,  1967) ; (4)  the  scale  of  abilities 
is. defined  regardless  of  the  characteristics  of  the  subject  population 
who  take  the  test;  and  (5)  a unique  standard  error  of  measurement  is 
associated  with  each  ability  level . 

The  significance  of  the  Rasch  logistic  model  may  be  appreciated 
by  comparing  it  to  "classical"  models  of  test  development: 

A psychological  test  having  these  general  characteristics 
would  become  directly  analogous  to  a yardstick  that  measures 
the  length  of  objects.  That  is,  the  intervals  on  the  yard- 
stick are  independent  of  the  length  of  the  objects,  and  the 
length  of  individual  objects  is  interpretable  without  re- 
spect to  which  particular  yardstick  is  used.  In  contrast, 
tests  developed  according  to  the  classical  model  have  neither 
characteristic.  The  score  obtained  by  a person  is  not  inter- 
pretable without  referring  to  both  some  norm  group  and  +-he 
particular  test  form  used.  ...  No  longer  would  equivalent 
forms  need  to  be  carefully  developed,  since  measurement  is 
instrument  independent  and  any  two  subsets  of  the  calibrated 
item  pool  could  be  used  as  alternative  instruments . Simi- 
larly, independence  of  measurement  from  a particular  popula- 
tion distribution  implies  that  tests  can  be  used  for  persons 
dissimilar  from  the  standardization  population  without  the 
necessity  of  collecting  new  norms  (Whitely  & Dawis,  1974, 

163-164) . 

Examples . Calibrating  a test  using  the  Rasch  model  results  in  a 
logarithmic  ability  estimate  being  assigned  to  every  possible  raw  score. 
This  estimate  indicates  the  amount  of  ability  required  to  achieve  that 
raw  score.  A comparison  of  the  ability  estimates  assigned  to  a given 
raw  score  by  two  samples  with  different  ability  distributions  indicates 
the  degree  to  whicl  the  Rasch  model  calibrates  a test  independently  of 
the  ability  level  of  the  calibration  sample. 

Wright  (1967)  studied  the  responses  of  976  beginning  law  students 
to  48  reading  comprehension  items  on  the  L.S.A.T.  To  obtain  samples 
with  different  ability  distributions,  he  selected  two  contrasting 
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groups  from  his  total  sample.  The  lower  group  included  the  325  students 
who  did  poorest  on  the  test,  with  a top  score  of  23.  The  higher  group 
included  the  303  students  with  the  highest  scores , with  a bottom  score 
of  33.  Wright  compared  the  similarity  between  the  two  sets  of  Rasch 
ability  estimates  and  the  two  sets  of  percentile  ranks.  Figure  1 shows 
the  results,  in  terms  of  "person-bound  test  calibration,"  where  a plot 
of  raw  score  against  percentile  rank  clearly  shows  two  different  ability 
groups.  If  a person  is  said  to  be  in  the  nth  percentile,  reference  must 
be  made  to  which  group  that  person  belongs. 

After  subjecting  these  same  data  ,to  the  Rasch  logistic  analysis, 
the  test  scores  are  transformed  into  ability  measurements  along  the 
ordinate.  Figure  2 shows  that  the  curves  for  the  best  and  worst  exami- 
nees almost  completely  overlap. 

The  difficulty  estimates  based  upon  these  dichotomous  examinee 
groups  are  statistically  equivalent.  Therefore,  these  estimates  are 
independent  of  the  ability  of  the  examinees  in  the  calibration  sample, 
and  may  be  used  over  the  entire  range  of  ability.  Comparing  the  cali- 
bration curves  of  these  figures  shows  the  contrast  between  (1)  calibra- 
tion based  upon  the  ability  distribution  of  a standardizing  sample,  and 
(2)  calibration  that  is  free  from  the  effects  of  the  ability  distribu- 
tion of  the  examinees  used  for  the  calibration. 

Can  ability  be  measured  in  a fashion  that  frees  it  from  dependence 
on  the  use  of  a fixed  set  of  items?  If  a pool  of  test  items  has  been 
calibrated  on  a common  scale,  can  any  set  of  items  be  selected  from  that 
pool  to  make  statistically  equivalent  ability  measurements? 

Wright  (1967)  tested  these  hypotheses  by  making  it  as  difficult  as 
possible  for  person  measurement  to  be  item  free.  He  divided  the  origi- 
nal test  items  into  two  non-overlapping  subtests,  the  easiest  items 
comprising  one  subtest  and  the  hardest  items  comprising  the  other  sub- 
test. The  model  predicts  that  ability  estimates  based  upon  the  easy 
subtest  should  be  statistically  equivalent  to  those  estimates  based 
upon  the  hard  subtest. 

The  solution  required  converting  the  scores  to  log  abilities,  and 
then  standardizing  the  differences, in  ability  estimates.  First,  for 
each  score,  the  corresponding  log  ability  on  the  calibration  curves  was 
obtained  (see  Figure  2) . For  each  pair  of  scores  (from  the  easy  and 
hard  subtests) , a pair  of  estimated  log  abilities  was  obtained.  Then, 
a standardized  difference  was  found  by  dividing  the  difference  between 
the  easy  and  hard  subtest  ability  estimates  by  the  measurement  error 
of  the  differences.  If  the  ability  estimates  are  statistically  equiva- 
lent, then  the  distribution  of  standardized  differences  should  have  a 
mean  equal  to  zero  and  a standard  deviation  equal  to  one.  The  obtained 
values  were  .003  and  1.014,  respectively. 
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(adapted  from  Wright,  1967). 


Applications.  A more  detailed  example  will  show  how  the  Rasch 
model  was  used  to  analyze  the  results  of  a criterion-referenced  test 
(Kifer  & Bramble,  1974) . The  data  were  obtained  from  201  college  stu- 
dents taking  an  84-item  multiple  choice  examination  in  introductory 
educational  psychology.  After  discarding  items  that  did  not  fit  the 
model,  the  final  test  contained  68  items. 

Comparison  of  the  Rasch-derived  ability  estimates  to  a criterion 
score  can  proceed  in  two  ways. 

The  first  is  analogous  to  determining  the  probability  of  committing 
a Type  I error  in  classical  hypothesis  testing.  That  is,  if  the  cri- 
terion ability  corresponds  to  the  null  hypothesis,  we  must  determine  the 
probability  that  an  obtained  ability  could  have  arisen  from  random  sam- 
pling from  a distribution  with  a mean  equal  to  the  criterion  ability  and 
a standard  deviation  equal  to  the  error  associated  with  the  criterion 
ability. 

The  second  is  analogous  to  determining  the  probability  of  commit- 
ting a Type  It  error  in  classical  hypothesis  testing.  That  is,  given 
an  obtained  ability  estimate  and  associated  error  (standard  deviation) , 
we  seek  the  probability  that  the  criterion  ability  could  have  been  ob- 
served from  random  sampling  from  the  distribution  corresponding  to  the 
obtained  ability  estimate. 

Kifer  and  Bramble  chose  to  define  their  criterion  score  as  80%  of 
the  items  correct  or  54.4  items  correct.  Their  cutoff  score  was  there- 
fore 55.  A raw  score  of  55  yields  an  ability  estimate  of  1.69,  with  a 
standard  error  of  .33.  Suppose  a raw  score  of  60  were  obtained.  What 
is  the  probability  that  this  score  exceeds  the  criterion  score  of  55? 

The  solution  requires  that  we  find  the  probability  that  this  score 
is  part  of  the  criterion  distribution,  with  mean  equal  to  1.69  and  stan- 
dard devision  equal  to  .33.  (1)  Kifer  and  Bramble's  parameter  estimates 

show  that  an  observed  score  of  60  has  an  ability  value  equal  to  2.32. 

(2)  2.32  - 1.69  = .63  units  of  difference  between  the  observed  and  cri- 
terion abilities.  (3)  .63/. 33  = 1.91  standard  deviations  of  difference 
between  the  ability  values.  (4)  A table  of  the  normal  distribution 
shows  that  1 - F(1.91)  = .03.  Therefore,  the  ability  value  of  2.32 
has  a probability  = .03  of  coming  from  a normal  distribution  with  a 
mean  = 1.69  and  standard  deviation  = .33. 

There  is  a second  method  by  which  ability  estimates  may  be  com- 
pared to  mastery  standards.  This  method  requires  the  probability  that 
the  criterion  ability  is  part  of  the  distribution  which  has  a qiven 
(observed)  ability  as  its  mean  and  the  given  ability  standard  error 
as  its  standard  deviation.  We  now  need  to  find  the  probability  that 
the  true  ability  corresponding  to  a score  of  60  does  not  exceed  the 
criterion  ability.  (1)  Kifer  and  Bramble's  parameter  estimates  show 
that  an  observed  score  of  60  has  an  ability  value  equal  to  2.32  and  a 
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standard  error  equal  to  .39.  (2)  2.32  - 1.69  = .63  units  of  differfnce 

between  the  observed  and  criterion  abilities.  (3)  .63/. 39  = 1.62  stan- 

dard deviaitons  of  difference  between  the  abilities.  (4)  A table  of  the 
normal  distribution  shows  that  1 - F(1.62)  = .05.  Therefore,  the  abil- 
ity value  of  1.69  has  a probability  of  .05  of  cominq  from  a normal  dis- 
tribution with  mean  = 2.32  and  standard  deviation  = .39.  Therefore, 
the  probability  that  an  examinee  with  a score  of  60  has  a true  ability 
below  the  criterion  value  = .05,  which  is  the  Type  XI  error  analog  that 
the  criterion  score  would  not  be  obtained  by  chance  given  the  obtained 
ability. 

Anderson  et  al.  (1968)  investigated  the  hypothesis  that  Rasch  item 
easiness  estimates  are  independent  of  the  ability  of  the  calibrating 
sample,  and  that  the  item  easiness  estimates  are  more  stable  when  only 
items  that  fit  the  model  are  considered.  They  used  the  45-item  spiral 
omnibus  intelligence  test  for  screening  applicants  to  the  Australian 
Army  or  Royal  Australian  Navy.  Samples  of  608  recruit  applicants  to 
the  Citizen  Military  Force  (CMF)  and  874  recruit  applicants  to  the  Royal 
Australian  Navy  were  studied.  Twelve  items  were  deleted  for  zero  or 
for  100%  correct  responses. 

For  the  CMF  sample,  30  items  (91%)  fit  the  model  at  the  .01  confi- 
dence level,  and  25  items  (76%)  fit  the  model  at  the  more  strinqent  .05 
level  of  confidence.  (The  level  of  confidence  represents  the  probability 
of  obtaining  the  observed  pattern  of  responses,  assuming  that  the  model 
is  adequate  to  explain  performance  on  the  item.)  For  the  Navy  sample, 
the  corresponding  findings  were  22  items  (67%)  and  16  items  (48%) . 

The  correlation  between  the  item  easiness  estimates  from  both  sam- 
ples was  .958  (based  upon  33  items) . When  the  items  that  failed  to  fit 
the  model  at  the  .05  level  were  deleted,  the  correlation  increased  to 
.990.  It  therefore  appears  that  the  item  easiness  ratios  were  indepen- 
dent of  the  ability  of  the  samples  from  which  they  were  computed.  It 
should  be  critically  noted  that  an  intelligence  test  was  used,  and  that 
the  two  subject  populations  probably  did  not  differ  siqnif icantly . 

In  a more  recent  study,  Tinsley  and  Dawis  (1975)  gave  four  types 
of  tests  (verbal,  numerical,  picture,  and  item-symbol  analogies)  to  four 
groups  of  subjects:  college  students,  high  school  students,  civil  ser- 
vice clerks,  and  clients  of  the  state  Division  of  Vocational  Rehabilita- 
tion (DVR) . If  Wright's  (1967)  findinqs  could  be  replicated,  then  the 
ability  estimates  of  one  group  should  correlate  hiqhly  with  the  ability 
estimates  of  another  group  for  the  same  test.  Of  the  10  correlations 
that  were  computed  (e.g.,  college  students  and  high  school  students  for 
the  picture  test,  high  school  students  and  DVR  clients  on  verbal  analo- 
gies) , all  reached  +.999.  The  invariant  relationship  between  the  ability 
estimates  calculated  for  a 25-item  verbal  analogies  test  for  630  college 
students  and  90  DVR  clients  replicated  the  relationship  reported  by 
Wright  (1967)  and  shown  in  Figure  2.  Tinsley  and  Dawis  conclude  that 
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. Rasch  ability  estimates  are  invariant  with  respect  to  the  ability 
of  the  calibrating  sample."  (p.  337) 

Tinsley  and  Dawis  also  investigated  the  degree  to  which  the  item 
parameters  (item  difficulty  estimates  and  z-item  difficulty  ratios)  were 
invariant  when  the  analyses  were  performed  on  all  items  of  the  test. 

The  correlation  of  item  difficulty  estimates  for  a given  test  from  two 
examinee  groups  tended  to  be  rather  large  (+.90) . Interestingly,  cor- 
relations close  to  zero  were  obtained  from  the  DVR  group  with  both  high 
school  and  college  students.  This  unexpected  finding  may  be  attributed 
to  the  small  (n  = 89)  sample  of  DVR  subjects.  Generally,  the  item  easi- 
ness ratios  were  invariant  with  respect  to  the  ability  of  the  calibrat- 
ing sample  of  examinees,  even  though  several  of  the  comparisons  used 
samples  of  questionable  size. 

Evaluation.  The  studies  cited  have  demonstrated  that  if  the  assump- 
tions are  met,  or  even  reasonably  approximated,  then  person-free  test 
calibration  and  item-free  person  measurement  can  be  achieved  by  using 
this  one-parameter  logistic  model.  Although  Hambleton  and  Traub  (1973) 
report  that  a logistic  model  with  an  item  discrimination  index  as  a 
second  parameter  provides  a better  fit  to  their  data,  the  inclusion  of 
this  second  parameter  violates  true  "objectivity  in  measurement"  (Wright, 
1967)  . 

Several  potential  shortcomings  may  pose  some  difficulty  in  success- 
fully implementing  the  model:  (1)  a pool  of  items  must  be  developed 
that  conforms  to  this  item-analysis  model,  and  the  items  must  be  cali- 
brated (perhaps  20%  of  the  items  will  have  to  be  either  discarded  or 
revised) ; (2)  the  item  calibration  and  standardization  procedures  re- 
quire dozens  of  items  and  hundreds  of  subjects:  (3)  the  model  does  not 
make  direct  predictions  about  optimal  test  lengths  or  cuttinq  scores  as 
do  the  models  of  Macready  and  Novick  and  Lewis;  and  (4)  the  mathematics 
of  the  model  can  become  quite  complex,  posing  problems  for  actually  im- 
plementing the  model  and  for  interpretation  of  output.  However,  recent 
publications  and  the  availability  of  computer  programs  (Wright  & Mead, 
1975,  1976)  alleviate  this  difficulty. 

The  major  virtues  of  the  Rasch  model  can  be  summarized  as  follows: 
(1)  Once  a test  has  been  standardized  on  any  group  of  subjects,  it  can 
be  given  again  to  a different  group,  without  the  need  to  create  parallel 
forms.  For  example,  a test  which  had  been  developed  by  giving  it  to 
"masters"  could  later  be  given  to  "nonmasters."  (2)  All  abilities  will 
be  on  the  same  scale,  regardless  of  the  subset  of  items  from  which  these 
abilities  were  estimated.  Thus,  person  A can  be  measured  on  a hard  test, 
and  person  B on  an  easy  test. 
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Regression  Theory 


Assumptions  and  Rationale.  The  criterion-referenced  testing  litera- 
ture has  tended  to  emphasize  the  supposed  dichotomy  between  classical 
test  theory  and  the  emerging  CRT  theory.  The  following  discussion  of 
regression  as  a means  for  assessing  mastery  is  intended  to  point  out 
the  similarities  between  several  CRT  strategies  and  classical  theory. 
Specifically,  both  the  Bayesian  and  logistic  models  produce  estimated 
distributions  of  ability,  as  does  classical  regression.  A cutoff  score 
must  still  be  set  at  some  point  on  the  ability  (score)  distributions, 
regardless  of  what  model  is  used  to  derive  the  distributions.  This  sec- 
tion simply  portrays  classical  regression  theory  in  terms  of  CRT  theory. 

The  regress ion- theoretic  approach  of  the  "classical  testing  model" 
(Lord  & Novick,  1968)  describes  the  reason  for  lack  of  perfect  mastery- 
nonmastery  observed  scores  in  terms  of  specified  or  estimated  errors  of 
measurement.  The  observed  score  is  considered  to  be  an  unbiased  esti- 
mate of  an  examinee's  true  score.  It  is  then  possible  to  derive  a 
regression  function  that  could  be  used  to  estimate  true  scores  from 
observed  scores.  The  equation  for  the  regression  function  is 

R(t|x)  = r ,X  + (1  - r ,)m  . (19) 

where  R(t|x)  = the  true  score  T given  the  observed  score  X,  rxxt  = the 
reliability  of  the  test,  and  mx  = the  mean  of  the  observed  scores. 

The  magnitude  of  several  types  of  error  may  also  be  determined. 

The  error  of  measurement  is  the  error  involved  when,  for  a randomly 
selected  examinee,  we  take  the  observed  score  as  an  estimate  of  the 
true  score.  This  can  be  expressed  as  E = X - T,  and  the  random  variable 
E,  taking  on  values  of  e,  is  called  the  error  of  measurement.  The 
standard  deviation  of  this  error  of  measurement,  called  the  standard 
error  of  measurement,  can  be  expressed  in  terms  of  the  standard  devia- 
tion of  observed  scores  and  the  reliability  of  the  test: 


The  difference  between  the  linear  regression  estimate  and  the  true 
score  itself  is  called  the  error  of  estimation,  and  is  expressed  sym- 
bolically as  e = rxx(x  - m) (T  - m^) . (21) 

The  standard  deviation  of  these  errors,  called  the  standard  error 

of  estimation,  is  expressed  as  s = s (1  - r ) . (22) 

e x \ xx  xx 

Example.  A graphic  representation  of  the  regression  technique  for 
a five-item  test  is  shown  in  Figure  3.  For  each  observed  score,  an  esti- 
mated true  score  is  obtained  from  R(t|x),  and  the  standard  error  of 
estimation  se  is  calculated.  A cutoff  score  based  upon  true  scores  may 
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then  be  specified.  (In  this  example,  a true  score  of  4 correct  has 
arbitrarily  been  chosen  as  the  cutoff  score.) 

The  output  of  the  regression  model,  like  that  for  the  Rasch  model, 
is  a set  of  distributions.  The  mean  of  each  distribution  is  the  value 
for  each  R(t|x),  and  the  common  standard  error  for  all  of  the  distribu- 
tions is  se.  If  the  decision  rule  requires  that  all  examinees  be  classi 
fied  as  masters  when  the  value  of  R(T|X)  exceeds  the  criterion,  and  that 
all  other  scores  should  lead  to  a nonmastery  decision,  then  the  probabil 
ity  of  misclassification  can  be  calculated. 

For  persons  with  observed  scores  and  estimated  true  scores  below 
the  criterion  value,  the  probability  that  such  persons  might  be  misclas- 
sified  as  nonmasters  is  simply  the  proportion  of  the  distribution  ex- 
ceeding the  criterion  value.  For  persons  with  observed  scores  and 
estimated  true  scores  above  the  criterion,  the  probability  that  such 
persons  might  be  misclassified  as  nonmasters  is  the  proportion  of  the 
distribution  below  the  criterion. 

These  probabilities  of  misclassification  are  represented  as  dotted 
and  crosshatched  areas,  respectively,  in  Figure  3.  If  we  assume  that 
the  error  of  estimation  is  normally  distributed,  then  the  probabilities 
can  be  readily  obtained  from  a table  of  normal  probabilities. 

Two  final  comments  are  necessary.  First,  this  procedure  uses  the 
standard  error  of  estimate,  rather  than  the  standard  error  of  measure- 
ment; se  will  always  be  smaller  than  Sg,  since  more  information  is  used 
in  calculating  the  estimated  true  score  with  a regression  function  than 
in  estimating  true  score  as  the  observed  score.  Thus,  there  is  good 
reason  to  use  the  estimated  true  scores  R(t|x)  in  any  analysis  of  test 
data.  Second,  the  assumption  of  normality  becomes  important  only  when 
calculating  misclassification  errors.  If  the  standard  error  of  estimate 
cannot  be  assumed  to  be  normally  distributed,  it  may  still  be  reported, 
and  may  prove  to  be  useful  in  obtaining  an  estimate  of  the  goodness  of 
the  test. 

Evaluation.  The  regression  theory  approach  is  not  a predictive 
model  in  the  sense  that  the  models  developed  by  Dayton  and  Macready, 
Emrick,  Millman,  and  Novick  are  predictive  of  desired  test  lengths  and 
optimal  cutoff  scores.  However,  the  regression  approach  does  give  prob- 
abilistic estimates  of  true  scores,  given  the  observed  scores.  The 
assumptions  of  normally  distributed  standard  errors  of  estimate  and  of 
equal  standard  errors  for  all  abilities  may  also  be  difficult  to  meet, 
although  such  departures  may  not  pose  a serious  problem.  And,  since 
this  is  a linear  regression  model,  it  is  assumed  that  the  regression 
of  true  scores  on  observed  scores  is  linear.  This  is  a generally  rea- 
sonable, though  perhaps  overly  simplistic,  assumption  to  make.  Because 
the  regression  model  has  been  used  for  many  years  longer  than  the  other 
models  reviewed  in  this  paper,  there  is  a greater  theoretical  and 
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empirical  literature  to  back  it  up  than  there  is  for  the  newer,  less 
established  models.  For  a more  technical  critique  of  the  use  of  re- 
gression models  for  estimating  true  scores  from  observed  scores,  see 
Appendix  B. 


SUMMARY  AND  CONCLUSIONS 


Nature  of  Performance  Acquisition 

Performance  acquisition  is  assumed  to  be  an  all-or-none  phenomenon, 
according  to  the  models  developed  by  Emrick  and  by  Dayton  and  Macready 
(see  Table  1) . Hence,  these  models  assume  that  error-free  test  per- 
formance is  also  dichotomous.  But  the  binomial,  Bayesian,  logistic, 
and  classical  regression  models  assume  that  performance  acquisition  is 
continuous.  Performance  on  dichotomously  scored  test  items  must  there- 
fore be  mapped  onto  an  equivalent  position  on  the  underlying  ability 
continuum  (Roudabush,  1974) . It  is  not  possible  to  decide  unequivo- 
cally that  one  assumption  is  more  correct  than  the  other,  since  the 
nature  of  performance  acquisition  most  likely  interacts  with  the  par- 
ticular type  of  task.  Some  tasks  tend  to  elicit  unitary,  highly  prac- 
ticed, sequential  behaviors,  and  would  seem  to  be  performed  in  an  all- 
or-none  fashion.  Tasks  which  require  multiskilled  performances  would 
more  closely  approximate  the  assumptions  of  the  continuous  skill 
acquisition  models. 


Measurement  Error 

Measurement  error  is  defined  as  the  difference  between  observed 
test  score  and  true  (unobservable)  score  that  would  be  obtained  if  mea- 
surement were  perfect.  It  is  most  important  when  one  tries  to  infer  a 
true  "error- free"  score  from  observed  data.  The  Block  and  Crehan  methods 
do  not  estimate  a true  score,  nor  do  they  deal  directly  with  measurement 
error.  Rather,  they  relate  observed  scores  directly  to  an  external  cri- 
terion. Hence,  any  systematic  error  will  not  be  a problem.  But  random 
errors  which  affect  the  consistency  of  observed  scores  will  disturb  the 
measurement  process  for  individual  cases.  Fortuitously,  such  errors 
will  tend  to  average  out  across  groups  of  examinees,  allowing  generali- 
zations to  be  made  which  should  be  valid  in  the  "long  run." 

The  all-or-none  models  deal  with  measurement  error  by  stipulating 
values  for  the  probability  of  masters  committing  errors  and  for  nonmas- 
ters guessing  correctly.  These  values  are  obtain'd  by  fitting  the  all- 
or-none  models  to  observed  data.  Responses  from  both  mastery  and  non- 
mastery groups  can  be  described  by  binomial  distributions. 

The  "continuous"  models  of  Novick,  Rasch,  and  regression  theory 
deal  with  measurement  error  by  reporting  a standard  error  for  each  true 
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score  estimate.  In  particular,  the  Rasch  model  provides  a check  on  how 
well  the  model's  output  approximates  the  observed  score  matrix  (Wright) 
and  Mead,  1975,  1976) . "Best  fit"  techniques  are  required  for  the 
Bayesian  and  regression  models.  The  binomial  models  do  not  rely  direct- 
ly on  observed  data,  and  hence,  do  not  deal  directly  with  measurement 
error.  Instead,  for  any  hypothesized  level  of  mastery,  the  models  pre- 
dict the  observed  score  distribution.  Adequacy  of  the  models'  predic- 
tions can  be  evaluated  by  fitting  data  to  the  hypothesized  distributions 
A more  complete  comparison  of  how  these  models  are  affected  by  measure- 
ment error  must  await  either  Monte  Carlo  simulation  studies  or  consider- 
able efforts  of  empirical  research. 


Classification  Error 


Unlike  measurement  error,  classification  error  refers  to  assigning 
individuals  to  inappropriate  mastery  level  groups — masters  to  the  non- 
mastery group,  and  nonmasters  to  the  mastery  level  group.  Such  errors 
could  occur  even  with  error-free  measurement.  However,  measurement 
error  interacts  with  classification  error,  further  complicating  the 
decisionmaking  process  of  assigning  examinees  to  mastery  level  groups. 
Suppose  that,  because  of  measurement  error,  all  estimates  of  true  score 
tended  to  be  inflated.  For  a given  decision  rule,  this  would  tend  to 
decrease  false  negatives  and  increase  false  positives.  Unfortunately, 
constant  measurement  error  is  the  exception  rather  than  the  rule,  making 
it  virtually  impossible  to  correct  for  it,  and  therefore  separate  it 
from  classification  error. 

The  Block  and  Crehan  models  deal  with  classification  error  empiri- 
cally by  comparing  the  decisions  based  on  a test  score  with  an  external 
criterion.  Hence,  the  classification  error  can  be  determined  simply  by 
counting  the  number  of  observed  misclassifications . If  examinee  groups 
remain  similar  over  time,  these  models  probably  provide  useful  and  stable 
estimates  of  misclassification  error. 

Because  none  of  the  other  models  incorporates  an  external  criterion, 
a direct  measure  of  classification  error  is  not  possible.  Instead,  the 
models  rely  on  the  distributional  information  obtained  for  the  estimated 
true  scores.  With  this  information,  it  is  possible  to  predict  the  prob- 
ability of  misclassification,  given  various  cutoff  scores.  Further  em- 
pirical work  which  incorporates  art  external  criterion  is  needed  to 
verify  the  accuracy  of  such  predictions. 

An  essential  ingredient  of  decisionmaking  on  the  basis  of  CRT 
scores  is  the  concept  of  cost — both  to  the  examinee  and  to  the  system 
which  he  or  she  is  being  prepared  to  join.  Consider  the  case  of  profes- 
sional licensing,  such  as  for  new  medical  doctors:  with  an  extremely 
strict  criterion,  many  would  fail,  morale  would  be  low,  and  the  system 
(society)  would  be  deprived  of  much-needed  medical  service.  However, 
with  a very  lax  criterion,  more  examinees  would  pass  who  may  not 


(unfortunately)  be  qualified,  and  society  would  thus  suffer  the  conse- 
quences of  having  "nonmasters"  in  practice.  A similar  case  could  be 
made  for  automobile  mechanics,  military  medics,  television  repairmen, 
etc.  Emrick's  model  is  the  only  one  that  directly  incorporates  monetary 
costs  of  incorrect  classifications  into  its  procedures.  However,  an 
objective  cost  factor  could  also  be  incorporated  into  the  other  models 
quite  readily.  But  none  of  the  models,  as  developed,  deals  with  more 
complex  kinds  of  cost,  such  as  morale,  costs  to  society  (which  may  have 
to  bemeasured  in  terms  of  utility,  not  dollars),  or  even  the  cost  of 
testing  as  opposed  to  not  testing  (Nader,  1976) . 


Test  Length 

For  performance-oriented  testing,  where  each  item  may  require  con- 
siderable time  and  expense,  it  is  essential  to  be  able  to  approximate 
the  minimum  number  of  items  needed  for  good  decisionmaking. 

Neither  the  Block  nor  the  Crehan  methods  explicitly  deals  with 
test  length.  These  models  were  designed  to  show  what  happens  when 
existing  test  results  are  compared  to  an  external  criterion.  However, 
since  the  data  are  available,  it  would  be  possible  to  reevaluate  the 
results,  assuming  that  only  some  of  the  test  items  were  used.  The 
regression  approach  allows  for  shorter  tests,  but  does  not  provide 
for  extrapolation  to  longer  tests. 

Since  the  binomial  model  does  not  rely  on  observed  data,  results 
for  tests  of  any  length  can  be  predicted.  This  aspect  of  the  model  is 
particularly  attractive,  since  a first  approximation  to  test  length  can 
be  easily  tried  out. 

The  all-or-none  models  use  observed  data  to  help  generate  the  neces- 
sary parameters.  Once  the  values  are  available,  it  is  possible  to  pre- 
dict the  results  for  tests  of  any  length.  As  in  the  Bayesian  model, 
such  predictions  will  be  valid  only  if  the  examinee  groups  remain  rela- 
tively stable. 


The  Bayesian  models  can  also  be  used  as  a predictor  for  test  re- 
sults of  any  test  length.  However,  estimates  of  the  values  of  several 
prior  probabilities  must  be  specified.  In  order  for  the  predicted 
results  to  be  applicable  to  real  data,  the  estimated  prior  probabili- 
ties must  be  close  approximations  to  the  priors  as  determined  post  hoc, 
after  data  have  been  collected.  The  main  feature  of  this  model — to 
reduce  test  length  as  a function  of  increasing  prior  information — will 
be  minimized  to  the  extent  that  the  prior  information  departs  from  cor- 


this  respect.  Since  the  item  difficulty  values  calculated  as  part  of 
the  procedure  are  invariant  across  examinee  groups  of  differing  ability, 
any  subset  of  items  can  be  used  with  any  group  of  examinees.  Further- 
more, the  errors  associated  with  each  calibrated  item  are  available, 
which  can  lead  to  precise  predictions  of  classification  error  for  tests 
made  up  of  a subset  of  the  original  item  pool. 

Conceptualization  of  Mastery 

The  only  models  that  explicitly  define  mastery  are  the  all-or-none 
models.  Deviations  from  perfection  or  total  lack  of  ability  are  defined 
as  measurement  error.  Mastery  is  not  explicitly  defined  in  any  of  the 
other  models.  Either  test  performance  is  related  to  some  other  perform- 
ance (Block  and  Crehan)  or  an  estimated  true  score  on  a continuum  is 
provided.  The  models  can  then  be  used  to  evaluate  test  results  or  any 
specified  definition  of  mastery. 

These  (continuous)  models  require  that  the  tester  be  extremely  sen- 
sitive to  system  requirements.  If  mastery  is  defined  in  terms  of  very 
high  performance,  then  very  few  examinees  are  likely  to  be  classified 
as  masters;  however,  if  mastery  is  defined  in  terms  of  less  demanding 
standards,  the  tester  (and  the  system)  runs  the  risk  of  having  a mas- 
tery group  that  is  less  than  adequate.  Thus,  the  validity  of  the  defi- 
nition of  mastery  in  terms  of  the  system  requirements  becomes  a crucial 
issue.  Empirical  studies  are  needed  in  specific  content  areas  to  deter- 
mine "how  much  ability"  a master  should  have. 


Characteristics  of  Items 

Only  the  Rasch  logistic  model,  of  all  the  models  discussed  in  this 
paper,  is  designed  for  item  analysis.  Other  models  relegate,  either  as 
assumptions  or  as  definitions,  such  matters  as  how  items  are  sampled, 
item  difficulty,  item  homogeneity,  and  item  independence.  Certainly  if 
an  item  set  can  be  shown  to  violate  these  assumptions  or  definitions, 
the  application  of  such  a model  would  be  questionable.  Little  theoreti- 
cal or  empirical  work  has  been  done  to  demonstrate  the  robustness  of 
these  models  to  violations  of  the  assumptions. 
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APPENDIX  A 


A GENERALIZATION  OF  THE  EMRICK  MODEL  FOR  THE  CASE  OF 
UNEQUAL  PROPORTIONS  OF  MASTERS  AND  NONMASTERS 

Kenneth  I.  Epstein^ 


The  phi  coefficient  is  a legitimate  measure  of  correlation  for 
data  expressed  as  frequencies  or  proportions;  it  is  not  appropriate 
for  conditional  probabilities.  The  entries  in  the  table  of  measure- 
ment errors  proposed  by  Emrick  and  Adams  (1970)  and  Emrick  (1971a, 
1971b)  are  conditional  probabilities.  A simple  numerical  example 
illustrates  the  type  of  problem  which  may  occur  if  conditional  prob- 
abilities are  used  to  calculate  (J).  Assume  that  a group  of  examinees 
is  made  up  of  80%  masters  and  20%  nonmasters,  that  10%  of  the  mastery 
group  incorrectly  respond  to  an  item,  and  that  5%  of  the  nonmastery 
group  correctly  respond  to  the  item.  This  situation  is  represented 
in  a fourfold  table  in  Table  A-l. 


Table  A-l 

Hypothetical  Response  Data  for 
Masters  and  Nonmasters 


True  State 

Observed 

response 

■Wrong 

Correct 

Master 

.10 

.70 

.80 

Nonmaster 

.15 

.05 

.20 

.25 

.75 

1.00 

The  phi  coefficient  for  Table  1 is: 

(.70)  (.15)  - (.10)  (.05)  = 5?74 

4>  = — ■ ■ ■ - 

V(-80)  (.20)  (.25)  (.75) 

The  above  represents  a valid  use  of  the  phi  coefficient. 


^My  appreciation  to  Dr.  George  Macready  for  pointing  out  the  problem 
and  suggesting  the  direction  of  its  solution. 
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We  may  now  calculate  a and  8 for  the  above  data,  a is  defined  as 
the  probability  that  a nonmaster  responds  correctly.  6 is  defined  as 
the  probability  that  a master  responds  incorrectly.  For  this  example: 

a = .05/. 20  = .250  1 - a = .750 

6 = .10/. 80  = .125  1 - 6 = .875 

These  data  are  represented  in  Table  A-2. 

Table  A-2 


Measurement  Errors  and  Mastery  State 
for  Hypothetical  Data 


True  state 

Observed  response 

Wrong 

Correct 

Mastery 

B = .125 

1 - 8 = .875 

1 

Nonmastery 

1 = a = .750 

a = .250 

1 

.875 

1.125 

2 

The  phi  coefficient  for  Table  A-2  is: 

(.875)  (.750)  - (.125)  (.250) 

<J>  = = .6299 

VTI)  (I)  (7975)  (1.125) 

Clearly  the  two  calculated  values  of  <(>  are  not  in  agreement.  Table 
A-2  is  the  sort  of  analysis  proposed  by  Emrick  and  Adams.  It  does  not 
represent  a valid  application  of  the  phi  coefficient. 

Fortunately,  one  can  obtain  a table  of  proportions  similar  to 
Table  A-l  from  a table  of  measurement  errors  similar  to  Table  A-2, 
simply  by  multiplying  each  entry  in  the  mastery  row  of  Table  A-2  by  the 
proportion  of  masters,  and  by  multiplying  each  entry  in  the  nonmastery 
row  of  Table  A-2  by  the  proportion  of  nonmasters.  The  general  form  for 
this  relationship  is  represented  in  Table  A-3. 


Table  A- 3 

Table  of  Proportions  for  Observed  Responses 
and  Mastery  State  in  Terms  of  a,  8,  P(M)  and  P(M) 


True  state 


Mastery 

Nonmastery 


Observed  response 


wrong 


P(M)6 

P(M)  (1  - a) 


Correct 


P(M)  (1  - 8) 
P(M)ct 


P (m)  6 + p (m)  (l  - a)  p(m)  a + p (m)  (1 


The  phi  coefficient  for  Table  A-3  is  derived  as  follows 
P ( M)  (1  - B)P(M)  (1  - a)  - F(M) B P (M) a 


[p(M)8  + P (M)  (1  - a)  ] [p(M)a  + p (m)  (1  - B)  ] p(m)p(m) 
P(M)P(M)  [(1  - B)  (1  - a)  - 6a  1 


[p(m)8  + p (m)  - p(M)a]  [p(M)a  + p(m)  - p(m)8]  p(m)  p(m) 
P (m) p (m)  [1  - 8 - al 


[p(M)P(M)aB  + p(m)^8  - P(M)2  B2  + P(M)2a  + p(m)p(m)  - p(m)p(m)B 
- 2 2 

P (M)  a - p(M)p(M)a  + P(M)P(M)a8]  P(M)P(M) 

P(M)P(M)  [1  - a - B] 


r 


P (M)  w P (M) 


02  + Iw  a + 1 - 8 - FfS-  “2  - a + a61 


[P  (M)  P (M)  ]' 


[1  - a - B] 


fl  - a - B + 2aB  + <6  - 62)  + [ot  - a2] 


Finally,  we  note  that  for  the  case  where  P(M)  = P(M),  the  formula 
above  reduces  to  the  formula  given  by  Emrick  and  Adams  : 


p 


[l  - Ot  - 8] 

<J>  = - — ■ - — 

Vl-oi-B  + 2aB  + 8-82+a-a2 

[1  - a - 6] 

Vl  - [a2  - 2 cie  + 82] 

[1  - a - 8] 

Vl  - (a  - 8) 2 

For  the  example  cited  in  the  text. 


1 - .06  - .12  .82 
* = 7=-  — = — 

\1  - .0036  .998 

If  we  have  a three-item  test, 
we  obtain 


= .822. 

upon  substituting  into  equation  (3) 


log 


.12 


1 - .06 


k = 


log 


TT 

log  .128  + 0 
log  .0087 


(.06  x .12  \ 

(1  - .06) (1  - .12)  j 


= .4339. 
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APPENDIX  B 


CRITIQUE  OF  THE  SIMPLIFYING  ASSUMPTIONS  IN  USING 
REGRESSION  MODELS  FOR  ESTIMATING  TRUE  SCORES 
FROM  OBSERVED  SCORES 

James  McBride 
Army  Research  Institute 

Since  R(t|x)  is  not  an  unbiased  estimator  of  T,  the  standard  devia- 
tion of  the  error  of  estimate  e is  not  the  same  as  the  conditional 
standard  deviation  of  the  true  score  for  a given  observed  score . That 
is,  if  e is  an  error  of  estimate  (T  - T)  , then  U2(e|x)  = cj2(t|x)  + bias2. 
Here,  cj2(t|x)  is  the  conditional  variance  of  the  true  scores  for  given 
observed  scores,  which  is  the  distribution  portrayed  in  Figure  3 and  used 
for  inference  to  the  misclassification  probabilities . 

However,  a2(e|x)  (or  equivalently,  a2(e))  is  then  not  the  appro- 
priate variance  unless  there  is  no  bias;  that  is,  unless  E(t|t)  = T. 

And  this  latter  relationship  is  generally  not  the  case.  Estimation  of 
classification  error  probabilities  using  a2(e)  as  the  conditional  vari- 
ance would  therefore  be  inappropriate. 

Linear  regression  of  T on  x is  a convenient  simplifying  assumption; 
but  in  actuality,  the  regression  may  often  be  nonlinear.  Also,  the 
distribution  of  errors  may  seldom  be  normal — or  even  symmetrical;  the 
same  holds  true  for  the  conditional  distribution  of  T.  In  sum,  the 
estimation  of  error  probabilities  from  simplified  linear  regression 
models  may  be  considerably  distorted  due  to  the  above  complicating 
factors . 
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1 USATRADOC.  Ft  Monroe.  ATTN:  ATRO-ED 
6 USATRADOC,  Ft  Monroe.  ATTN:  ATPR-AD 
1 USATRADOC.  Ft  Monroe.  ATTN:  ATTS-E A 

1 USA  Forces  Cmd,  Ft  McPherson,  ATTN:  Library 

2 USA  Aviation  Test  Bd,  Ft  Rucker,  ATTN:  STEBG-PO 

1 USA  Agcy  for  Aviation  Safety,  Ft  Rucker,  ATTN:  Library 
1 USA  Agcy  for  Aviation  Safety,  Ft  Rucker,  ATTN:  Educ  Advisor 
1 USA  Aviation  Sch.  Ft  Rucker,  ATTN:  PO  Drawer  0 

1 HQUSA  Aviation  Sys  Cmd,  St  Louis,  ATTN:  AMSAV-ZDR 

2 USA  Aviation  Sys  Twt  Act..  Edwards  AFB,  ATTN:  SAVTE-T 
1 USA  Air  Def  Sch,  Ft  Bliss.  ATTN:  ATSA  TEM 

1 USA  Air  Mobility  Rsch  & Dev  Lab.  Moffett  Fid,  ATTN:  SAVDL-AS 
1 USA  Aviation  Sch,  Res  Tng  Mgt,  Ft  Rucker,  ATTN:  ATST— T— RTM 
1 USA  Aviation  Sch,  CO,  Ft  Rucker,  ATTN:  ATST— D-A 
1 HQ,  DARCOM,  Alexandria,  ATTN:  AMXCD-TL 
1 HQ,  DARCOM,  Alexandria,  ATTN:  CDR 
1 US  Military  Academy,  West  Point,  ATTN:  Serials  Unit 
1 US  Military  Academy,  West  Point,  ATTN:  Ofc  of  Milt  Ldrshp 
1 US  Military  Academy.  West  Point,  ATTN:  MAOR 
1 USA  Standardization  Gp,  UK,  FPO  NY,  ATTN:  MASE-GC 
1 Ofc  of  Naval  Rsch,  Arlington.  ATTN:  Code  452 

3 Ofc  of  Naval  Rsch,  Arlington,  ATTN:  Code  458 
1 Ofc  of  Naval  Rsch,  Arlington,  ATTN:  Code  450 
1 Ofc  of  Naval  Rsch,  Arlington,  ATTN:  Code  441 

1 Naval  Aerospc  Med  Res  Lab,  Pensacola,  ATTN:  Acous  Sch  Div 
1 Naval  Aerospc  Med  Res  Lab.  Pensacola.  ATTN:  Code  L51 
1 Naval  Aerospc  Med  Res  Lab,  Pensacola,  ATTN:  Code  L5 
1 Chief  of  NavPers,  ATTN:  Pers-OR 
1 NAVAI RST A.  Norfolk.  ATTN:  Safety  Ctr 
1 Nav  Oceanographic,  DC,  ATTN:  Code  6251,  Charts  8t  Tech 
1 Center  of  Naval  Anal,  ATTN:  Doc  Ctr 
1 NavAirSysCom,  ATTN:  AIR-531  X 


1  Nav  BuMed.  ATTN:  713 
1 NavHelicopterSubSqua  2.  FPO  SF  96601 
1 AFHRL(FT)  William  AFB 
1 AFHRL  (TT)  Lowry  AFB 

1 AFHRL  (AS)  WPAFB.OH 

2 AFHRL  (DOJZ)  Brooks  AFB 

1 AFHRL  (DOJN)  Lackland  AFB 
1 HQUSA F (INYSD) 

1 HQUSAF  (DPXXA) 

1 AFVTG  (RD)  Randolph  AFB 

3 AMRL  (HE)  WPAFB.OH 

2 AF  Inst  of  Tech,  WPAFB,  OH,  ATTN:  ENE/SL 
1 ATC  (XPTD)  Randolph  AFB 

1 USAF  AeroMed  Lib.  Brooks  AFB  (SUL-4).  ATTN:  DOC  SEC 
1 AFOSR  (NL),  Arlington 

1 AF  Log  Cmd,  McClellan  AFB.  ATTN:  ALC/DPCRB 

1 Air  Force  Academy,  CO,  ATTN:  Dept  of  Bel  Sen 
5 NavPers  & Dev  Ctr,  San  Diego 

2 Navy  Med  Neuropsychiatric  Rsch  Unit.  San  Diego 
1 Nav  Electronic  Lab,  San  Diego,  ATTN:  Res  Lab 

1 Nav  TrngCen,  San  Diego,  ATTN:  Code  9000- Lib 
1 NavPostGraSch,  Monterey,  ATTN:  Code  55Aa 


1 NavPostGraSch,  Monterey,  ATTN:  Code  2124 
1 NavTrngEauipCtr,  Orlando,  ATTN:  Tech  Lib 
1 US  Dept  of  Labor,  DC,  ATTN:  Manpower  Admin 
1 US  Dept  of  Justice,  DC,  ATTN:  Drug  Enforce  Admin 
1 Nat  Bur  of  Standards,  DC,  ATTN:  Computer  Info  Section 
1 Nat  Clearing  House  for  MH-Info,  Rockville 
1 Denver  Federal  Ctr,  Lakewood,  ATTN:  BLM 
1 2 Defense  Documentation  Center 


4  Dir  Psych,  Army  Hq,  Russell  Ofcs,  Canberra 
1 Scientific  Advsr,  Mil  Bd,  Army  Hq,  Russell  Ofcs,  Canberra 
1 Mil  and  Air  Attache,  Austrian  Embassy 

1 Centre  de  Recherche  Des  Facteurs,  Humaine  de  la  Defense 
Nationale,  Brussels 

2 Canadian  Joint  Staff  Washington 

1 C/Air  Staff,  Royal  Canadian  AF,  ATTN:  Pers  Std  Anal  Br 

3 Chief,  Canadian  Def  Rsch  Staff,  ATTN:  C/CRDS(W) 

4 British  Def  Staff,  British  Embassy,  Washington 


1 Def  & Civil  Inst  of  Enviro  Medicine.  Canada 
1 AIR  CRESS.  Kensington.  ATTN:  Info  Sys  Br 
1 Militaerpsykologisk  Tjeneste.  Copehagtn 
1 Military  Attache.  French  Embassy,  ATTN:  Doc  Sec 
1 Medecin  Chef.  C.E.R.P.A.-Arsenal.  Toulon/Naval  Franca 
1 Prin  Scientific  Off.  Appl  Hum  Engr  Rich  Div.  Ministry 
of  Defense,  New  Delhi 

1 Pers  Rsch  Ofc  Library,  AKA.  Israel  Defense  Forces 
1 Minister  is  van  Defensie.  DOOP/KL  Afd  Sociaal 
Psychologische  Zeken,  The  Hague.  Netherlands 
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